Interpreting Tajima’s D

Introduction

Tajima’s D is a statistic that compares the average number of pairwise differences with the number of segregating sites. It’s an important statistic that is widely used in population genetics. However, it must be carefully analyzed because population demography changes how Tajima’s D can be interpreted. In this post, we will look at why Tajima’s D works and how demography (especially population bottlenecks) can cloud signals of Tajima’s D.
cow-bottleneck

Figure 1. A population bottleneck with a new rare allele (orange) arising after the bottleneck.

First, let’s look at the two major components of Tajima’s D. \hat{\theta}_T is the number of pairwise differences (Tajima’s estimator or \pi) and is given by:

\hat{\theta}_T=\frac{\sum\limits_{i<j} d_{ij}}{n(n-1)/2}

where d_{ij} is the number of differences between two sequences, i and j. n is given by the number of sequences to compare. \hat{\theta}_W is the number of segregating sites (Watterson’s estimator or S) given by:

\hat{\theta}_W=\frac{S}{\sum\limits_{i=1}^{n-1} 1/i}

where S is the number of sites that segregate in the sample (that is, the number of sites are variable). Now that we have these definitions, we can take a look at their differences. The expectation is that both estimators will be equal to \theta, which is the population scaled mutation rate. However, this is not the case! Because of the way it is calculated, \hat{\theta}_T will underestimate the number of mutations that are rare in the population. Because of this, comparing the two values of \theta gives us a test statistic, Tajima’s D:

D=\frac{\hat{\theta}_T - \hat{\theta}_W}{\sqrt{\hat{V}(\hat{\theta}_T - \hat{\theta}_W)}}

Rare variants contribute little to \hat{\theta}_T

In the examples below, sequences are represented as dashes and asterisks. The sequences are aligned and a dash is a site that is the same between sequences and the asterisks represents a mutation.

Given some sequences, what are the values for \hat{\theta}_T and \hat{\theta}_W?

Example 1

---*---*------
-------*---*--
-------*------
-----------*--

\hat{\theta}_T=\frac{2+1+3+1+1+2}{6}=1.67 and \hat{\theta}_W=\frac{3}{1+\frac{1}{2}+\frac{1}{3}}=1.63

So in this case, \hat{\theta}_T gives us a similar estimate compared to \hat{\theta}_W (thanks to Marie-Julie Favé and her students for pointing out an error here). However, this will change once we introduce more rare mutations. The numerator in \hat{\theta}_T is the number of differences between each sequence. For example, if you compare the first 2 sequences, there are 2 differences. The first and third have 1 difference, etc.

Example 2

-*------------
----*---------
-------*------
-----------*--

\hat{\theta}_T=\frac{2+2+2+2+2+2}{6}=2 and \hat{\theta}_W=\frac{4}{1+\frac{1}{2}+\frac{1}{3}}=2.2
Here, because each mutation is a rare variant, the estimate of \theta is lower in \hat{\theta}_T than in \hat{\theta}_W. This effect is not very pronounced in this example because n is so small, but if we increase n to 100 and make every mutation a rare one (i.e. at a frequency of 0.01), we can see the effect more clearly.

Example 3

-*--------------------------------------------------------------------------------------------------
--*-------------------------------------------------------------------------------------------------
---*------------------------------------------------------------------------------------------------
----*-----------------------------------------------------------------------------------------------
...
--------------------------------------------------------------------------------------------------*-

\hat{\theta}_T=\frac{2(99.5)}{99.5}=2 and \hat{\theta}_W=\frac{100}{1+\frac{1}{2}+...+\frac{1}{99}}=19.31
Woah that’s a huge difference! Because of this difference, we can compare the two estimators and learn something about what kind of variants we have in the population.

Tajima’s D

Tajima’s D is the comparison between the average number of pairwise differences and the number of segregating sites in a sample. We expect positive selection (or selective sweeps) to give us a negative Tajima’s D in a population that doesn’t have any demographic changes going on (population expansion/contraction, migration, etc). This is because \hat{\theta}_W will be greater than \hat{\theta}_T. Why? Because after a selective sweep, most of the haplotypes in a population will be the same. Therefore, when mutations occur they will be rare. When you have a lot of rare mutations, \hat{\theta}_T underestimates \theta compared to \hat{\theta}_W and you get a negative Tajima’s D.

In the case of balancing selection, alleles are kept at intermediate frequencies. This produces a positive Tajima’s D because there will be more pairwise differences than segregating sites.

The effect of bottlenecks

Under demographic scenarios, Tajima’s D can exhibit signals that look like selection even under neutral simulations. Above, you can see how Tajima’s D changes before, during, and after a bottleneck. The effect of population expansion (recovery) on Tajima’s D is staggering. As soon as the population expands, Tajima’s D drops to about -0.45. It continues to drop and begins to recover, but stays negative. This effect is common even when the bottleneck size changes. Below, you can see what happens to Tajima’s D after a bottleneck when the recovery from the bottleneck event varies (10% of the population survives in the bottleneck):

Tajima's D is negative after a bottleneck

Why does this happen? On average, a bottleneck event will remove rare alleles due to sampling and intermediate alleles will be rare. On average, there will be more rare variants than intermediate variants and Tajima’s D will be negative. This makes it hard to distinguish signals of selection from Tajima’s D when a population has gone through a bottleneck. A neutral population with no selection will look the same as a population that has recently undergone a selective sweep. Below is a stronger bottleneck (1% of population survives):

1% bottleneck

and here is a weaker bottleneck (20% of population survives):

20% bottleneck

In all cases, Tajima’s D drops to negative values after a bottleneck. The 1% bottleneck has a much different shape than either of the other bottlenecks. This is because when such an extreme bottleneck occurs, every allele will be rare and Tajima’s D will start out negative. It recovers from the bottleneck in a similar way towards 0. However, in all of these simulations, Tajima’s D never converges to 0 (note that these are neutral simulations). This is because when we have a finite number of samples, the resting point of Tajima’s D is not 0, but a small negative number close to 0.

As we can see from these simulations, Tajima’s D is a complicated statistic and care must be taken when using it to analyze populations. For more on the effect of bottlenecks on populations, check out this presentation: http://www.slideshare.net/jrossibarra/bottlenecks-some-ramblings-and-a-bit-of-data-from-maize

Acknowledgements: Thanks to Jeff Ross-Ibarra for discussions and help with the simulations.

R code for simulations:

options(scipen=999)
library(ggplot2)
sims=1000000
bneck_end=runif(sims,0,2) # end comes first backward in time
bneck_start=bneck_end+0.005 # start comes 2nd so is bigger number
tbsfile<-cbind(bneck_end,bneck_start)
write.table(file="tbsfile",tbsfile,col.names=F,row.names=F,quote=F)
#we do with recombination, with theta=10 or N~80,000
#bneck then is 0.005480,000 = 1,600 generations
f=pipe(paste("~/Downloads/msdir/ms 10 ", sims, " -t 10 -r 10 1000 -eN tbs 0.1 -eN tbs 1 < tbsfile | ~/Downloads/msdir/sample_stats | cut -f 6",sep=""))
taj_d=scan(f)
qplot(bneck_end,taj_d,geom="smooth", xlab="Bottleneck end (4N generations)", ylab="Tajima's D", main="Effect of recovery time on Tajima's D")

PCA with ANGSD and ngsCovar

As a follow up to the last post (PCA with samtools and plink) I’m going to go over how to do PCA using genotype likelihoods from ANGSD and with ngsCovar. This method isn’t integrated in to angsd-wrapper yet, but we are looking to add it.

The documentation is a little light for ngsCovar, but there is a tutorial here (shout out to Peter Fields for linking me to that).

PCA with ANGSD/ngsCovar is pretty straightforward once you know which files you need to generate. ngsCovar takes in a binary genotype likelihood file (.geno). This file is distinct from the .glf files that are also produced by ANGSD. To generate the .geno file, you need to supply these options:

angsd/angsd -bam data/all_samples.txt -GL 1 -out results/all.test -doMaf 2 -doMajorMinor 1 -doGeno 32 -doPost 1 -nind 21 -P 8 -r 1:

Specifically, you need -doGeno 32 and -doPost 1. These should generate a .geno.gz file. Once you have this, you need to unzip your .geno.gz file (ngsCovar won’t accept your zipped .geno file):

gunzip all.test.geno.gz

Now we can run ngsCovar and supply the genotype file. The other options will depend on your data, but the command should look something like this:

ngsPopGen/ngsCovar -probfile results/all.test.geno -outfile pop.covar -nind 21 -nsites 100000 -call 0

Then, following the graphing part of the tutorial we can get a nice graph of our population structure. In order to get what we need from the graph, we have to change the script a little bit:

Rscript -e 'write.table(cbind(seq(1,10),rep(1,10),c(rep("Indica",12),rep("Allopatric",4),rep("Sympatric",4),rep("BC1", 1))), row.names=F, sep=" ", col.names=c("FID","IID","CLUSTER"), file="test.pops.clst", quote=F)'

Specifically, we have to change the part that labels the samples to match what our samples actually are: c(rep("Indica",12),rep("Allopatric",4),rep("Sympatric",4),rep("BC1", 1).

Then we can run the plotPCA.R script provided just like in the example:

Rscript scripts/plotPCA.R -i pop.covar -c 1-2 -a test.pops.clst -o test.pca.pdf

And then you get a pretty graph that looks something like this:

test.pca

PS: Following up from last time, we have included an individual that was supposed to be a backcross (BC1). This individual clusters very closely to the allopatric and sympatric populations and the same sample is an outlier. This suggests 1) ANGSD/ngsCovar and samtools/plink give pretty similar results (at least for this data) and 2) we have switched labels! The outlier should be the BC1 individual and the BC1 individual should be part of the sympatric population.

Population clustering with samtools SNP calling and Plink MDS

I was looking for a guide on how to do population clustering using samtools and plink but couldn’t find a concise guide on exactly what I need to do. So, after some hacking around, I figured it out and decided to write it down. In order to do MDS on bam files, you need to go through several steps. The general idea is:

bwa ref.fasta sample[0-9].fastq > sample[0-9].bam
samtools mpileup -uf ref.fasta sample[0-9].bam | bcftools view -bvcg - > samples.raw.bcf
vcftools samples.vcf -plink samples
plink --file samples --genome
plink --file samples --read-genome samples.genome --cluster --mds-plot 2

Then you can graph your result using R or whatever plotting software you fancy.

Mpileup: Mpileup will create a bcf file which is piped to bcftools, where the -c option will call SNPs. This file is written to disk and then converted to vcf files because vcftools doesn’t like the bcf output of samtools.

[Note: some bam samples removed to reduce complexity]

samtools mpileup -r 1: -uf students/Oryza_indica.ASM465v1.24.dna.genome.fa.gz students/og276.sorted.bam students/og278.sorted.bam | bcftools view -bvcg - > results/all.raw.bcf
bcftools view all.raw.bcf > all.raw.vcf

Convert vcf to ped: This step will take the SNP called vcf file and convert it to a format usable by plink (ped and map).

vcftools --vcf all.raw.vcf --plink --out all.raw

Make a .genome file: In order to do MDS, plink needs a .genome file of your samples. This can be created with something like this:

plink --file all.raw --genome --noweb --allow-no-sex --out all.raw

Do some multidimensional scaling:

plink --file all.raw --read-genome all.raw.genome --cluster --mds-plot 2 --noweb

Plot:
d <- read.table("results/plink.mds", h=T)
d$pop = factor(c(rep("indica", 12), rep("allopatric", 4), rep("sympatric", 4)))
plot(d$C1, d$C2, col=as.integer(d$pop), pch=19, xlab="PC 1", ylab="PC 2", main = "MDS of Oryza (samtools/plink)")
legend("topright", c("Indica", "Allopatric", "Sympatric"), pch=19, col=c(2,1,3))
text(d$C1, d$C2, labels=c(rep(NA, 18), "og276", NA), pos=1)

pca

If you’re wondering what’s going on with og276, I have no idea yet! It should cluster with the other sympatric samples, but it didn’t.