Haplotype networks in R

Haplotype networks are pretty useful for genomic analyses, especially when you’re looking at phylogeography (Shannon et al 2015) or if you’re looking at a region of special interest (Huerta-Sánchez et al 2014). A quick google search will turn up a number of blog posts that tell you how to use pegas to create a haplotype network, but many of them use the provided wood mouse dataset, which is already nicely formatted. If you’re using whole genome data (e.g. Illumina), it turns out to be a little more complicated than just importing a fasta file. In this post, I’ll show you how to get from a whole genome VCF to a haplotype network.

Pegas requires a fasta file (there are other input methods, but I found fasta to be the most straightforward) with your locus of interest (e.g. mitochondria) and requires all individuals to have the same sequence length. The data I’m starting with is a whole genome, all sites VCF file. In order to extract the mitochondria, I wrote a python script available here. To use it, you should supply a vcf file (gzipped or not is fine), an output filename, and a chromosome of interest:

python vcf2fasta.py -v individuals.vcf.gz -g -o mitochondria.fasta -c mitochondria

This script will go through the VCF file and append the appropriate reference or alternate bases for each individual and output a valid fasta file.

In theory, we should be able to plug it into pegas now because it’s a fasta file with all individuals and each individual should have the same sequence length in the mitochondria, especially since it’s already been aligned to the reference. In practice, however, this is not the case. Some individuals will be missing bases, and sometimes the reference has weird issues where sites show up twice, etc. To fix this, I use clustal to align the fasta file again. Either clustal omega or clustalw should work, but clustal omega is supposed to be faster. After running clustal, you should output a fasta file where every individual has the same number of bases. Once this is done, we can finally use pegas to build a network.

input <- "MChloroplast.fasta"
d <- ape::read.dna(input, format='fasta')
e <- dist.dna(d)
h <- pegas::haplotype(d)
h <- sort(h, what = "label")
(net <- pegas::haploNet(h))
ind.hap<-with(
stack(setNames(attr(h, "index"), rownames(h))),
table(hap=ind, pop=rownames(d)[values])
)
plot(net, size=attr(net, "freq"), scale.ratio=0.2, pie=ind.hap)
legend(-8, 0, colnames(ind.hap), col=rainbow(ncol(ind.hap)), pch=19, ncol=2)

If everything goes right, you should get something like this:

unnamed-chunk-3-1

Advertisements

8 thoughts on “Haplotype networks in R

  1. Min Huang

    hello,i used your pytohon script,but it told me “No module named argparse”,why?and i need your your help thank u very much!

  2. Min Huang

    hi,i have install the package argparse,but when i run the csript,there is a mistake:_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?) line 29,and then i changed ‘r’ to ‘rt’,it run,but there are nothing ,just like this:”>XXH_XXH8-1″,but no sequence,why?thanks

  3. Hi,

    I’m running into problems with the python script. I phased my vcf file through ShapeIt and I’ve been using the following command:

    python vcf2FASTA.py -v Cadam.phased.vcf -o CadamAllPhasedSamples.fasta -c Cadamanteus_PLA2region_BAC_29M24

    Every time I try to run the script, I get this error message:

    Traceback (most recent call last):
    File “vcf2FASTA.py”, line 52, in
    base = alt_list[int(call)-1]
    ValueError: invalid literal for int() with base 10: ‘0|0’

    I’m not entirely familiar with coding, but I’m guessing that it’s rejecting my format. Otherwise, I’m not sure what I’m doing wrong.

    Any suggestions?

    1. Hi,

      Could you post a few lines of your VCF? I suspect this has to do with the fact that the script expects haploid calls (e.g. 0) instead of diploid calls (e.g. “0|0”). It should be straightforward to add in another line that splits the genotype call at “|”.

  4. marino

    Hi,
    I am having the same problem as abigelowsite. This is how my vcf file looks like for some individuals:
    GT:DP:DPR:RO:QR:AO:QA:GL 0/0:9:9,0:9:347:0:0:0,-2.70927,-31.6156 1/1:4:4,4:0:0:4:141:-13.0425,-1.20412,0 0/1:6:6,4:2:70:4:151:-12.1613,0,-4.84382

    It was not clear to me how to add in another line that splits the genotype call at “|”. It would be great if you could explain that in a bit more detail.
    Thank you so much!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s