Interpreting Tajima’s D


Tajima’s D is a statistic that compares the average number of pairwise differences with the number of segregating sites. It’s an important statistic that is widely used in population genetics. However, it must be carefully analyzed because population demography changes how Tajima’s D can be interpreted. In this post, we will look at why Tajima’s D works and how demography (especially population bottlenecks) can cloud signals of Tajima’s D.

Figure 1. A population bottleneck with a new rare allele (orange) arising after the bottleneck.

First, let’s look at the two major components of Tajima’s D. \hat{\theta}_T is the number of pairwise differences (Tajima’s estimator or \pi) and is given by:

\hat{\theta}_T=\frac{\sum\limits_{i<j} d_{ij}}{n(n-1)/2}

where d_{ij} is the number of differences between two sequences, i and j. n is given by the number of sequences to compare. \hat{\theta}_W is the number of segregating sites (Watterson’s estimator or S) given by:

\hat{\theta}_W=\frac{S}{\sum\limits_{i=1}^{n-1} 1/i}

where S is the number of sites that segregate in the sample (that is, the number of sites are variable). Now that we have these definitions, we can take a look at their differences. The expectation is that both estimators will be equal to \theta, which is the population scaled mutation rate. However, this is not the case! Because of the way it is calculated, \hat{\theta}_T will underestimate the number of mutations that are rare in the population. Because of this, comparing the two values of \theta gives us a test statistic, Tajima’s D:

D=\frac{\hat{\theta}_T - \hat{\theta}_W}{\sqrt{\hat{V}(\hat{\theta}_T - \hat{\theta}_W)}}

Rare variants contribute little to \hat{\theta}_T

In the examples below, sequences are represented as dashes and asterisks. The sequences are aligned and a dash is a site that is the same between sequences and the asterisks represents a mutation.

Given some sequences, what are the values for \hat{\theta}_T and \hat{\theta}_W?

Example 1


\hat{\theta}_T=\frac{2+1+3+1+1+2}{6}=1.67 and \hat{\theta}_W=\frac{3}{1+\frac{1}{2}+\frac{1}{3}}=1.63

So in this case, \hat{\theta}_T gives us a similar estimate compared to \hat{\theta}_W (thanks to Marie-Julie Favé and her students for pointing out an error here). However, this will change once we introduce more rare mutations. The numerator in \hat{\theta}_T is the number of differences between each sequence. For example, if you compare the first 2 sequences, there are 2 differences. The first and third have 1 difference, etc.

Example 2


\hat{\theta}_T=\frac{2+2+2+2+2+2}{6}=2 and \hat{\theta}_W=\frac{4}{1+\frac{1}{2}+\frac{1}{3}}=2.2
Here, because each mutation is a rare variant, the estimate of \theta is lower in \hat{\theta}_T than in \hat{\theta}_W. This effect is not very pronounced in this example because n is so small, but if we increase n to 100 and make every mutation a rare one (i.e. at a frequency of 0.01), we can see the effect more clearly.

Example 3


\hat{\theta}_T=\frac{2(99.5)}{99.5}=2 and \hat{\theta}_W=\frac{100}{1+\frac{1}{2}+...+\frac{1}{99}}=19.31
Woah that’s a huge difference! Because of this difference, we can compare the two estimators and learn something about what kind of variants we have in the population.

Tajima’s D

Tajima’s D is the comparison between the average number of pairwise differences and the number of segregating sites in a sample. We expect positive selection (or selective sweeps) to give us a negative Tajima’s D in a population that doesn’t have any demographic changes going on (population expansion/contraction, migration, etc). This is because \hat{\theta}_W will be greater than \hat{\theta}_T. Why? Because after a selective sweep, most of the haplotypes in a population will be the same. Therefore, when mutations occur they will be rare. When you have a lot of rare mutations, \hat{\theta}_T underestimates \theta compared to \hat{\theta}_W and you get a negative Tajima’s D.

In the case of balancing selection, alleles are kept at intermediate frequencies. This produces a positive Tajima’s D because there will be more pairwise differences than segregating sites.

The effect of bottlenecks

Under demographic scenarios, Tajima’s D can exhibit signals that look like selection even under neutral simulations. Above, you can see how Tajima’s D changes before, during, and after a bottleneck. The effect of population expansion (recovery) on Tajima’s D is staggering. As soon as the population expands, Tajima’s D drops to about -0.45. It continues to drop and begins to recover, but stays negative. This effect is common even when the bottleneck size changes. Below, you can see what happens to Tajima’s D after a bottleneck when the recovery from the bottleneck event varies (10% of the population survives in the bottleneck):

Tajima's D is negative after a bottleneck

Why does this happen? On average, a bottleneck event will remove rare alleles due to sampling and intermediate alleles will be rare. On average, there will be more rare variants than intermediate variants and Tajima’s D will be negative. This makes it hard to distinguish signals of selection from Tajima’s D when a population has gone through a bottleneck. A neutral population with no selection will look the same as a population that has recently undergone a selective sweep. Below is a stronger bottleneck (1% of population survives):

1% bottleneck

and here is a weaker bottleneck (20% of population survives):

20% bottleneck

In all cases, Tajima’s D drops to negative values after a bottleneck. The 1% bottleneck has a much different shape than either of the other bottlenecks. This is because when such an extreme bottleneck occurs, every allele will be rare and Tajima’s D will start out negative. It recovers from the bottleneck in a similar way towards 0. However, in all of these simulations, Tajima’s D never converges to 0 (note that these are neutral simulations). This is because when we have a finite number of samples, the resting point of Tajima’s D is not 0, but a small negative number close to 0.

As we can see from these simulations, Tajima’s D is a complicated statistic and care must be taken when using it to analyze populations. For more on the effect of bottlenecks on populations, check out this presentation:

Acknowledgements: Thanks to Jeff Ross-Ibarra for discussions and help with the simulations.

R code for simulations:

bneck_end=runif(sims,0,2) # end comes first backward in time
bneck_start=bneck_end+0.005 # start comes 2nd so is bigger number
#we do with recombination, with theta=10 or N~80,000
#bneck then is 0.005480,000 = 1,600 generations
f=pipe(paste("~/Downloads/msdir/ms 10 ", sims, " -t 10 -r 10 1000 -eN tbs 0.1 -eN tbs 1 < tbsfile | ~/Downloads/msdir/sample_stats | cut -f 6",sep=""))
qplot(bneck_end,taj_d,geom="smooth", xlab="Bottleneck end (4N generations)", ylab="Tajima's D", main="Effect of recovery time on Tajima's D")


Understanding Markov Chains

Before you start, have you read the conventions page?


A Markov chain is a mathematical model for random processes where time is taken into consideration. For each step of time (each tick of a clock), there is a probability that some event will happen. This probability depends on what the previous event was.

For example, let’s say we built a robot that has a very simple set of instructions. It can only move 1 meter in 1 of 2 directions (forward or backwards) every second and the direction it chooses depends on the last direction it chose. If the robot moved up in the last second, it has a 75% chance of moving down in the next second (and vice-versa). If you observe and record the robot’s movement at each time step, you might get something like this:

t | movement
0 | forward
1 | backwards
2 | forward
3 | backwards
4 | backwards


This could go on for any number of time steps, generating a sequence of events. This sequence is called a Markov chain.

A better definition

Mathematically, when a random process, X = (X_0, X_1, X_2, ..., X_n), follows the following property, it is a Markov chain:

P[X_{n+1} = s | X_n = s_n, X_{n-1} = s_{n-1}, ... , X_0 = s_0] = P[X_{n+1} = s | X_n = s_n]

where S is a set containing all the possible states (the state space). Each member of X is a random variable. The above equation is stating, simply, that the future value of X (X_{n+1}) is dependent on the present values of X.

We can also create a diagram of the possible states and probabilities of transition in something called a state transition diagram:

This can also be described as a matrix. The same Markov chain is described below:

This matrix is called the transition matrix. Each value in the matrix must be greater than zero, and each row much sum to 1. Or in fancy latex: P_{ij} \geq 0, \sum_{j} P_{ij} = 1 for all i.

Storing the probabilities in a matrix allows us perform linear algebra operations on these Markov chains, which I will talk about in another blog post.


Stanford Markov Chain tutorial:

Youtube Markov Chain video:

Understanding CRISPR/Cas

Before you start, have you read the conventions page?

With the recent New York Times article on the subject, I thought it would be a good idea to work through how CRISPR/Cas works and explain its significance in a little more detail.

What is CRISPR/Cas?

CRISPR/Cas is an immune system found in bacteria (and archaea) that stops viruses from replicating themselves.

When viruses invade cells, their goal is to replicate their DNA and create new viruses. They invade cells because they don’t have the cellular machinery (ribosomes, tRNAs, etc.) to do so themselves. To replicate themselves, they inject their DNA into a cell. The cellular machinery that replicates DNA is kind of dumb, so it will replicate any DNA that it finds. The virus is then copied many times and leaves the host cell so it can invade other cells.

In order to stop this, some bacteria have a system that can recognize viral DNA and create proteins that chop it up into pieces, stopping it from creating new viruses. This system is the CRISPR/Cas system.

How does it work?


CRISPR/Cas is a simple and, in my opinion, elegant immune system. Basically, the bacterium creates an RNA that recognizes invading virus DNA and binds to it. Then, a generalized protein finds the RNA and cuts the virus DNA. Thus, CRISPR/Cas is composed of two main parts, a specific module and a general module. CRISPR is the part that corresponds to specificity and Cas (as you may have guessed) is a more general set of genes. The main advantage to this modularity is adaptability. By separating the part that destroys DNA with the part the recognizes DNA, the bacteria can make one protein that cuts and one RNA that guides, rather than having a specialized protein that both binds and cuts DNA for every virus. When a new virus enters the cell, the bacteria can add the DNA to the CRISPR portion and drop it right into the existing immune system.

CRISPR stands for “Clustered Regularly Interspaced Short Palindromic Repeats”, which gives us an idea of the structure of the DNA. In the diagram above, there is a region labeled “repeat array.” This is made up of spacer sequences separated by repeats (the terminology is a little misleading). These spacers contain the recognition sequences the Cas proteins use to find specific DNA sequences.  These spacer sequences are actually viral DNA that have been integrated into the bacterial genome. When the spacer sequence is transcribed to RNA, it can line up with the actual invading virus DNA and bind to it. Then the Cas protein can cleave the virus DNA and inactivate it.

The other portion of the repeat array is the repeats, which are palindromic. This means that once the DNA is transcribed to RNA, the RNA can base pair with itself and form the loop structure seen in the diagram above. This structure interacts with the Cas protein and forms a larger structure, which is labeled as the protein-RNA complex above.

Cas stands for CRISPR associated sequence and is named that way because when researchers discovered CRISPR, they found this group of genes near it. Cas has a few genes that are known to cut DNA. Their main function is to find the CRISPR RNA and cut the DNA that it’s bound to. Additionally, some Cas genes are known to add in viral DNA to the CRISPR spacers, allowing more virus DNA sequences to be recognized. The way this happens isn’t well-known, however.

Thus, when these genes are transcribed and translated, 3 parts come together. First you have the Cas protein, which cleaves DNA. Next you have the palindromic repeat region of the RNA that binds to the Cas protein and finally, there is the spacer region which finds the specific virus DNA and binds to it.

How are people using it?

One of the exciting things about CRISPR/Cas is the way that people are using it. As the NYTimes piece above explains, it’s possible to “edit” genomes. One way to do this is to modify the Cas9 gene, the gene that cuts the DNA, to be catalytically inactive. This means that instead of cutting like it’s supposed to, it just binds to the DNA and stays there without cutting it. Then, the DNA it bound to cannot be transcribed to RNA because there is a large protein in the way of RNA polymerase. This is the method presented by Qi et. al. in 2013, where they use this method for gene repression.

Another way to inactivate genes is to use the CRISPR/Cas system the way it was intended to — to cut specific DNA sequences. The difference here is that we can engineer the system to bind to a sequence of our choosing rather than a sequence that is already present in the system. For example, if there is a gene we want to turn off, we can engineer the system to recognize a portion of it (20-72 base pairs). When the small guide RNA locates that sequence, Cas9 will cut a few base pairs of the DNA. Now, there are preexisting mechanisms in the cell that can repair DNA breaks like this known as non-homologous end joining. However, this mechanism doesn’t know what the DNA should look like, all it does it stick the ends of the DNA back together. Because Cas9 cut a few base pairs out, when the DNA is glued back together, it is missing some information. When the DNA is transcribed and translated into a protein, the protein will now contain different information and will be nonfunctional. This is what Niu et. al. 2014 did in the paper linked to by the NYTimes piece.

Gene inactivation/repression is useful when there is a gene producing harmful transcripts and that gene is not necessary for the organism’s survival.

Genome editing is another use of CRISPR/Cas. There is a mechanism in cells called homology directed repair that uses DNA sequence that the cell knows is “reputable” and uses that to copy the information into the DNA break. Thus, it’s possible to use CRISPR/Cas to create a targeted break in the DNA and introduce a gene to the cell and have the cell use the introduced gene as a template for homology directed repair.

How is it different from similar methods?

Targeted cleavage of DNA is not something new to science. Several methods allow for the cleavage of DNA in very specific regions such as restriction endonucleases, zinc finger nucleases, and transcription activator-like effector nucleases. As you may have guessed from the names, a nuclease is a protein the cleaves DNA.

A restriction enzyme is a protein that recognizes a very short sequence of DNA (4-8 base pairs) and cleaves the DNA at a certain spot in that recognition sequence [wiki]. These are very commonly used in biology and their activity has been well characterized.

Zinc finger nucleases are engineered proteins that have two parts [wiki]. One part cleaves DNA, just like Cas. The other part is a DNA binding protein called a zinc finger. Just like the spacer region of the CRISPR/Cas sequence, it can bind to specific sequences. The difference here is that the zinc finger is a protein, while the spacer region is an RNA.

Transcription activator-like effector nucleases are also engineered proteins with two parts [wiki]. Like the zinc finger nuclease and the Cas protein, one domain cleaves DNA and one domain binds to DNA. The DNA binding domain in TALENs are called transcription activator-like effectors, and they have very specific nucleotide recognition sequences.

Both zinc finger nucleases and transcription activator-like effector nucleases can be engineered to recognize specific sequences, just like CRISPR/Cas. However, both methods are susceptible to off target cleavage if the target site is not unique in the genetic sequence of interest. One of the reasons CRISPR/Cas is exciting is because it has a longer recognition sequence so there is less of a chance of off target cleavage.


Hopefully this post taught you something more about the CRISPR/Cas system and how it’s being used (it certainly taught me). If you find errors or things that should be clarified, please don’t hesitate to tell me!


  1. Lei S. Qi, Matthew H. Larson, Luke A. Gilbert, Jennifer A. Doudna, Jonathan S. Weissman, Adam P. Arkin, Wendell A. Lim. “Repurposing CRISPR as an RNA-Guided Platform for Sequence-Specific Control of Gene Expression.” Cell – 28 February 2013 (Vol. 152, Issue 5, pp. 1173-1183). DOI.

  2. Fedor V. Karginov, Gregory J. Hannon. “The CRISPR System: Small RNA-Guided Defense in Bacteria and Archaea.” Molecular Cell – 15 January 2010 (Vol. 37, Issue 1, pp. 7-19). DOI.