Interpreting Tajima’s D


Tajima’s D is a statistic that compares the average number of pairwise differences with the number of segregating sites. It’s an important statistic that is widely used in population genetics. However, it must be carefully analyzed because population demography changes how Tajima’s D can be interpreted. In this post, we will look at why Tajima’s D works and how demography (especially population bottlenecks) can cloud signals of Tajima’s D.

Figure 1. A population bottleneck with a new rare allele (orange) arising after the bottleneck.

First, let’s look at the two major components of Tajima’s D. \hat{\theta}_T is the number of pairwise differences (Tajima’s estimator or \pi) and is given by:

\hat{\theta}_T=\frac{\sum\limits_{i<j} d_{ij}}{n(n-1)/2}

where d_{ij} is the number of differences between two sequences, i and j. n is given by the number of sequences to compare. \hat{\theta}_W is the number of segregating sites (Watterson’s estimator or S) given by:

\hat{\theta}_W=\frac{S}{\sum\limits_{i=1}^{n-1} 1/i}

where S is the number of sites that segregate in the sample (that is, the number of sites are variable). Now that we have these definitions, we can take a look at their differences. The expectation is that both estimators will be equal to \theta, which is the population scaled mutation rate. However, this is not the case! Because of the way it is calculated, \hat{\theta}_T will underestimate the number of mutations that are rare in the population. Because of this, comparing the two values of \theta gives us a test statistic, Tajima’s D:

D=\frac{\hat{\theta}_T - \hat{\theta}_W}{\sqrt{\hat{V}(\hat{\theta}_T - \hat{\theta}_W)}}

Rare variants contribute little to \hat{\theta}_T

In the examples below, sequences are represented as dashes and asterisks. The sequences are aligned and a dash is a site that is the same between sequences and the asterisks represents a mutation.

Given some sequences, what are the values for \hat{\theta}_T and \hat{\theta}_W?

Example 1


\hat{\theta}_T=\frac{2+1+3+1+1+2}{6}=1.67 and \hat{\theta}_W=\frac{3}{1+\frac{1}{2}+\frac{1}{3}}=1.63

So in this case, \hat{\theta}_T gives us a similar estimate compared to \hat{\theta}_W (thanks to Marie-Julie Favé and her students for pointing out an error here). However, this will change once we introduce more rare mutations. The numerator in \hat{\theta}_T is the number of differences between each sequence. For example, if you compare the first 2 sequences, there are 2 differences. The first and third have 1 difference, etc.

Example 2


\hat{\theta}_T=\frac{2+2+2+2+2+2}{6}=2 and \hat{\theta}_W=\frac{4}{1+\frac{1}{2}+\frac{1}{3}}=2.2
Here, because each mutation is a rare variant, the estimate of \theta is lower in \hat{\theta}_T than in \hat{\theta}_W. This effect is not very pronounced in this example because n is so small, but if we increase n to 100 and make every mutation a rare one (i.e. at a frequency of 0.01), we can see the effect more clearly.

Example 3


\hat{\theta}_T=\frac{2(99.5)}{99.5}=2 and \hat{\theta}_W=\frac{100}{1+\frac{1}{2}+...+\frac{1}{99}}=19.31
Woah that’s a huge difference! Because of this difference, we can compare the two estimators and learn something about what kind of variants we have in the population.

Tajima’s D

Tajima’s D is the comparison between the average number of pairwise differences and the number of segregating sites in a sample. We expect positive selection (or selective sweeps) to give us a negative Tajima’s D in a population that doesn’t have any demographic changes going on (population expansion/contraction, migration, etc). This is because \hat{\theta}_W will be greater than \hat{\theta}_T. Why? Because after a selective sweep, most of the haplotypes in a population will be the same. Therefore, when mutations occur they will be rare. When you have a lot of rare mutations, \hat{\theta}_T underestimates \theta compared to \hat{\theta}_W and you get a negative Tajima’s D.

In the case of balancing selection, alleles are kept at intermediate frequencies. This produces a positive Tajima’s D because there will be more pairwise differences than segregating sites.

The effect of bottlenecks

Under demographic scenarios, Tajima’s D can exhibit signals that look like selection even under neutral simulations. Above, you can see how Tajima’s D changes before, during, and after a bottleneck. The effect of population expansion (recovery) on Tajima’s D is staggering. As soon as the population expands, Tajima’s D drops to about -0.45. It continues to drop and begins to recover, but stays negative. This effect is common even when the bottleneck size changes. Below, you can see what happens to Tajima’s D after a bottleneck when the recovery from the bottleneck event varies (10% of the population survives in the bottleneck):

Tajima's D is negative after a bottleneck

Why does this happen? On average, a bottleneck event will remove rare alleles due to sampling and intermediate alleles will be rare. On average, there will be more rare variants than intermediate variants and Tajima’s D will be negative. This makes it hard to distinguish signals of selection from Tajima’s D when a population has gone through a bottleneck. A neutral population with no selection will look the same as a population that has recently undergone a selective sweep. Below is a stronger bottleneck (1% of population survives):

1% bottleneck

and here is a weaker bottleneck (20% of population survives):

20% bottleneck

In all cases, Tajima’s D drops to negative values after a bottleneck. The 1% bottleneck has a much different shape than either of the other bottlenecks. This is because when such an extreme bottleneck occurs, every allele will be rare and Tajima’s D will start out negative. It recovers from the bottleneck in a similar way towards 0. However, in all of these simulations, Tajima’s D never converges to 0 (note that these are neutral simulations). This is because when we have a finite number of samples, the resting point of Tajima’s D is not 0, but a small negative number close to 0.

As we can see from these simulations, Tajima’s D is a complicated statistic and care must be taken when using it to analyze populations. For more on the effect of bottlenecks on populations, check out this presentation:

Acknowledgements: Thanks to Jeff Ross-Ibarra for discussions and help with the simulations.

R code for simulations:

bneck_end=runif(sims,0,2) # end comes first backward in time
bneck_start=bneck_end+0.005 # start comes 2nd so is bigger number
#we do with recombination, with theta=10 or N~80,000
#bneck then is 0.005480,000 = 1,600 generations
f=pipe(paste("~/Downloads/msdir/ms 10 ", sims, " -t 10 -r 10 1000 -eN tbs 0.1 -eN tbs 1 < tbsfile | ~/Downloads/msdir/sample_stats | cut -f 6",sep=""))
qplot(bneck_end,taj_d,geom="smooth", xlab="Bottleneck end (4N generations)", ylab="Tajima's D", main="Effect of recovery time on Tajima's D")


12 thoughts on “Interpreting Tajima’s D

  1. Dopey

    I’ve always considered selective sweeps a type of (effective) population bottleneck. Will there be differences, not at the global scale (considering the whole genome), but at the local scale (plotting D using a sliding window along the chromosome) in randomly mating populations that undergo a moderate amount of recombination? What will a selective sweep look like if there is quite a bit of inbreeding going on?

  2. Lili


    Thank you very much Arun for this clear explanation of theta estimators and Tajima’s D! I used your examples (with credits) in a pop gen graduate class that I am teaching, and my students and I were wondering how you got to a value of Theta (T) in example 1 of 3.17. Your state that ThetaT= (2+1+3+1+1+2)/6, which totally makes sense, but this equals to 10/6=1.67, not 3.17. Are we missing something?


    1. Hello! Thanks very much for your comment. I’m glad this post is able to help you all out! You’re absolutely correct — I’ve updated the blog post and credited you (let me know if I got the name correct). Don’t know what I was thinking, thanks for the correction!

  3. DS


    Thank you so much for your post, this is very helpful.
    I also found a miscalculation in the first example, θ(W) should be 18/11 = 1.633…,right?


  4. ArtemP

    Hi, Arun! Great post. I’ve got a question.

    “On average, a bottleneck event will remove rare alleles due to sampling and intermediate alleles will be rare.” Do you mean here that the intermediate alleles that were intermediate before a bottleneck will become rare (during or recovery phase)? Do you have a sketch of the demographic model that you simulated? Cheers.

    1. Hi Artem,

      Thanks! Yes, that’s what I mean. I don’t have a figure handy but the details of the simulation are in the R code at the bottom. Specifically, the f=pipe(…) line calls ms with the specified demography. There is a bottleneck that lasts 1600 generations and is a 10 fold reduction in population size. Hope that helps!

      1. ArtemP

        Thanks for replying!

        Is it a new finding?

        Hmm… I always thought it’s kind of the opposite – D goes up during the bottleneck, which removes the low-freq alleles, and then it goes negative during the recovery phase, when the rare alleles return (de novo mutations?), and then it reaches an equilibrium with time. It’s also on Jeff’s ppt that you link to (slide 9).

  5. ArtemP

    ahh, now I got, it depends on the intensity if the bottleneck whether D immediately goes down or up and then down as you also describe. sorry for the confusion. :)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s