As a follow up to the last post (PCA with samtools and plink) I’m going to go over how to do PCA using genotype likelihoods from ANGSD and with ngsCovar. This method isn’t integrated in to angsd-wrapper yet, but we are looking to add it.
The documentation is a little light for ngsCovar, but there is a tutorial here (shout out to Peter Fields for linking me to that).
PCA with ANGSD/ngsCovar is pretty straightforward once you know which files you need to generate. ngsCovar takes in a binary genotype likelihood file (.geno). This file is distinct from the .glf files that are also produced by ANGSD. To generate the .geno file, you need to supply these options:
angsd/angsd -bam data/all_samples.txt -GL 1 -out results/all.test -doMaf 2 -doMajorMinor 1 -doGeno 32 -doPost 1 -nind 21 -P 8 -r 1:
Specifically, you need -doGeno 32 and -doPost 1. These should generate a .geno.gz file. Once you have this, you need to unzip your .geno.gz file (ngsCovar won’t accept your zipped .geno file):
Now we can run ngsCovar and supply the genotype file. The other options will depend on your data, but the command should look something like this:
ngsPopGen/ngsCovar -probfile results/all.test.geno -outfile pop.covar -nind 21 -nsites 100000 -call 0
Then, following the graphing part of the tutorial we can get a nice graph of our population structure. In order to get what we need from the graph, we have to change the script a little bit:
Rscript -e 'write.table(cbind(seq(1,10),rep(1,10),c(rep("Indica",12),rep("Allopatric",4),rep("Sympatric",4),rep("BC1", 1))), row.names=F, sep=" ", col.names=c("FID","IID","CLUSTER"), file="test.pops.clst", quote=F)'
Specifically, we have to change the part that labels the samples to match what our samples actually are:
Then we can run the plotPCA.R script provided just like in the example:
Rscript scripts/plotPCA.R -i pop.covar -c 1-2 -a test.pops.clst -o test.pca.pdf
And then you get a pretty graph that looks something like this:
PS: Following up from last time, we have included an individual that was supposed to be a backcross (BC1). This individual clusters very closely to the allopatric and sympatric populations and the same sample is an outlier. This suggests 1) ANGSD/ngsCovar and samtools/plink give pretty similar results (at least for this data) and 2) we have switched labels! The outlier should be the BC1 individual and the BC1 individual should be part of the sympatric population.