Haplotype networks are pretty useful for genomic analyses, especially when you’re looking at phylogeography (Shannon et al 2015) or if you’re looking at a region of special interest (Huerta-Sánchez et al 2014). A quick google search will turn up a number of blog posts that tell you how to use pegas to create a haplotype network, but many of them use the provided wood mouse dataset, which is already nicely formatted. If you’re using whole genome data (e.g. Illumina), it turns out to be a little more complicated than just importing a fasta file. In this post, I’ll show you how to get from a whole genome VCF to a haplotype network.
Pegas requires a fasta file (there are other input methods, but I found fasta to be the most straightforward) with your locus of interest (e.g. mitochondria) and requires all individuals to have the same sequence length. The data I’m starting with is a whole genome, all sites VCF file. In order to extract the mitochondria, I wrote a python script available here. To use it, you should supply a vcf file (gzipped or not is fine), an output filename, and a chromosome of interest:
python vcf2fasta.py -v individuals.vcf.gz -g -o mitochondria.fasta -c mitochondria
This script will go through the VCF file and append the appropriate reference or alternate bases for each individual and output a valid fasta file.
In theory, we should be able to plug it into pegas now because it’s a fasta file with all individuals and each individual should have the same sequence length in the mitochondria, especially since it’s already been aligned to the reference. In practice, however, this is not the case. Some individuals will be missing bases, and sometimes the reference has weird issues where sites show up twice, etc. To fix this, I use clustal to align the fasta file again. Either clustal omega or clustalw should work, but clustal omega is supposed to be faster. After running clustal, you should output a fasta file where every individual has the same number of bases. Once this is done, we can finally use pegas to build a network.
input <- "MChloroplast.fasta"
d <- ape::read.dna(input, format='fasta')
e <- dist.dna(d)
h <- pegas::haplotype(d)
h <- sort(h, what = "label")
(net <- pegas::haploNet(h))
stack(setNames(attr(h, "index"), rownames(h))),
plot(net, size=attr(net, "freq"), scale.ratio=0.2, pie=ind.hap)
legend(-8, 0, colnames(ind.hap), col=rainbow(ncol(ind.hap)), pch=19, ncol=2)
If everything goes right, you should get something like this: