Population clustering with samtools SNP calling and Plink MDS

I was looking for a guide on how to do population clustering using samtools and plink but couldn’t find a concise guide on exactly what I need to do. So, after some hacking around, I figured it out and decided to write it down. In order to do MDS on bam files, you need to go through several steps. The general idea is:

bwa ref.fasta sample[0-9].fastq > sample[0-9].bam
samtools mpileup -uf ref.fasta sample[0-9].bam | bcftools view -bvcg - > samples.raw.bcf
vcftools samples.vcf -plink samples
plink --file samples --genome
plink --file samples --read-genome samples.genome --cluster --mds-plot 2

Then you can graph your result using R or whatever plotting software you fancy.

Mpileup: Mpileup will create a bcf file which is piped to bcftools, where the -c option will call SNPs. This file is written to disk and then converted to vcf files because vcftools doesn’t like the bcf output of samtools.

[Note: some bam samples removed to reduce complexity]

samtools mpileup -r 1: -uf students/Oryza_indica.ASM465v1.24.dna.genome.fa.gz students/og276.sorted.bam students/og278.sorted.bam | bcftools view -bvcg - > results/all.raw.bcf
bcftools view all.raw.bcf > all.raw.vcf

Convert vcf to ped: This step will take the SNP called vcf file and convert it to a format usable by plink (ped and map).

vcftools --vcf all.raw.vcf --plink --out all.raw

Make a .genome file: In order to do MDS, plink needs a .genome file of your samples. This can be created with something like this:

plink --file all.raw --genome --noweb --allow-no-sex --out all.raw

Do some multidimensional scaling:

plink --file all.raw --read-genome all.raw.genome --cluster --mds-plot 2 --noweb

d <- read.table("results/plink.mds", h=T)
d$pop = factor(c(rep("indica", 12), rep("allopatric", 4), rep("sympatric", 4)))
plot(d$C1, d$C2, col=as.integer(d$pop), pch=19, xlab="PC 1", ylab="PC 2", main = "MDS of Oryza (samtools/plink)")
legend("topright", c("Indica", "Allopatric", "Sympatric"), pch=19, col=c(2,1,3))
text(d$C1, d$C2, labels=c(rep(NA, 18), "og276", NA), pos=1)


If you’re wondering what’s going on with og276, I have no idea yet! It should cluster with the other sympatric samples, but it didn’t.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s