## Introduction

Tajima’s D is a statistic that compares the average number of pairwise differences with the number of segregating sites. It’s an important statistic that is widely used in population genetics. However, it must be carefully analyzed because population demography changes how Tajima’s D can be interpreted. In this post, we will look at why Tajima’s D works and how demography (especially population bottlenecks) can cloud signals of Tajima’s D.

** Figure 1.** A population bottleneck with a new rare allele (orange) arising after the bottleneck.

First, let’s look at the two major components of Tajima’s D. is the number of pairwise differences (Tajima’s estimator or ) and is given by:

where is the number of differences between two sequences, i and j. is given by the number of sequences to compare. is the number of segregating sites (Watterson’s estimator or S) given by:

where S is the number of sites that segregate in the sample (that is, the number of sites are variable). Now that we have these definitions, we can take a look at their differences. The expectation is that both estimators will be equal to , which is the population scaled mutation rate. However, this is not the case! Because of the way it is calculated, will underestimate the number of mutations that are rare in the population. Because of this, comparing the two values of gives us a test statistic, Tajima’s D:

### Rare variants contribute little to

In the examples below, sequences are represented as dashes and asterisks. The sequences are aligned and a dash is a site that is the same between sequences and the asterisks represents a mutation.

Given some sequences, what are the values for and ?

**Example 1**

```
---*---*------
-------*---*--
-------*------
-----------*--
```

and

So in this case, gives us a similar estimate compared to (thanks to Marie-Julie Favé and her students for pointing out an error here). However, this will change once we introduce more rare mutations. The numerator in is the number of differences between each sequence. For example, if you compare the first 2 sequences, there are 2 differences. The first and third have 1 difference, etc.

**Example 2**

```
-*------------
----*---------
-------*------
-----------*--
```

and

Here, because each mutation is a rare variant, the estimate of is lower in than in . This effect is not very pronounced in this example because n is so small, but if we increase n to 100 and make every mutation a rare one (i.e. at a frequency of 0.01), we can see the effect more clearly.

**Example 3**

```
-*--------------------------------------------------------------------------------------------------
--*-------------------------------------------------------------------------------------------------
---*------------------------------------------------------------------------------------------------
----*-----------------------------------------------------------------------------------------------
...
--------------------------------------------------------------------------------------------------*-
```

and

Woah that’s a huge difference! Because of this difference, we can compare the two estimators and learn something about what kind of variants we have in the population.

### Tajima’s D

Tajima’s D is the comparison between the average number of pairwise differences and the number of segregating sites in a sample. We expect positive selection (or selective sweeps) to give us a negative Tajima’s D in a population that doesn’t have any demographic changes going on (population expansion/contraction, migration, etc). This is because will be greater than . Why? Because after a selective sweep, most of the haplotypes in a population will be the same. Therefore, when mutations occur they will be rare. When you have a lot of rare mutations, underestimates compared to and you get a negative Tajima’s D.

In the case of balancing selection, alleles are kept at intermediate frequencies. This produces a positive Tajima’s D because there will be more pairwise differences than segregating sites.

### The effect of bottlenecks

Under demographic scenarios, Tajima’s D can exhibit signals that look like selection even under neutral simulations. Above, you can see how Tajima’s D changes before, during, and after a bottleneck. The effect of population expansion (recovery) on Tajima’s D is staggering. As soon as the population expands, Tajima’s D drops to about -0.45. It continues to drop and begins to recover, but stays negative. This effect is common even when the bottleneck size changes. Below, you can see what happens to Tajima’s D after a bottleneck when the recovery from the bottleneck event varies (10% of the population survives in the bottleneck):

Why does this happen? On average, a bottleneck event will remove rare alleles due to sampling and intermediate alleles will be rare. On average, there will be more rare variants than intermediate variants and Tajima’s D will be negative. This makes it hard to distinguish signals of selection from Tajima’s D when a population has gone through a bottleneck. A neutral population with no selection will look the same as a population that has recently undergone a selective sweep. Below is a stronger bottleneck (1% of population survives):

and here is a weaker bottleneck (20% of population survives):

In all cases, Tajima’s D drops to negative values after a bottleneck. The 1% bottleneck has a much different shape than either of the other bottlenecks. This is because when such an extreme bottleneck occurs, every allele will be rare and Tajima’s D will start out negative. It recovers from the bottleneck in a similar way towards 0. However, in all of these simulations, Tajima’s D never converges to 0 (note that these are neutral simulations). This is because when we have a finite number of samples, the resting point of Tajima’s D is not 0, but a small negative number close to 0.

As we can see from these simulations, Tajima’s D is a complicated statistic and care must be taken when using it to analyze populations. For more on the effect of bottlenecks on populations, check out this presentation: http://www.slideshare.net/jrossibarra/bottlenecks-some-ramblings-and-a-bit-of-data-from-maize

Acknowledgements: Thanks to Jeff Ross-Ibarra for discussions and help with the simulations.

R code for simulations:

options(scipen=999)

library(ggplot2)

sims=1000000

bneck_end=runif(sims,0,2) # end comes first backward in time

bneck_start=bneck_end+0.005 # start comes 2nd so is bigger number

tbsfile<-cbind(bneck_end,bneck_start)

write.table(file="tbsfile",tbsfile,col.names=F,row.names=F,quote=F)

#we do with recombination, with theta=10 or N~80,000

#bneck then is 0.005*4*80,000 = 1,600 generations

f=pipe(paste("~/Downloads/msdir/ms 10 ", sims, " -t 10 -r 10 1000 -eN tbs 0.1 -eN tbs 1 < tbsfile | ~/Downloads/msdir/sample_stats | cut -f 6",sep=""))

taj_d=scan(f)

qplot(bneck_end,taj_d,geom="smooth", xlab="Bottleneck end (4N generations)", ylab="Tajima's D", main="Effect of recovery time on Tajima's D")