Haplotype networks in R

Haplotype networks are pretty useful for genomic analyses, especially when you’re looking at phylogeography (Shannon et al 2015) or if you’re looking at a region of special interest (Huerta-Sánchez et al 2014). A quick google search will turn up a number of blog posts that tell you how to use pegas to create a haplotype network, but many of them use the provided wood mouse dataset, which is already nicely formatted. If you’re using whole genome data (e.g. Illumina), it turns out to be a little more complicated than just importing a fasta file. In this post, I’ll show you how to get from a whole genome VCF to a haplotype network.

Pegas requires a fasta file (there are other input methods, but I found fasta to be the most straightforward) with your locus of interest (e.g. mitochondria) and requires all individuals to have the same sequence length. The data I’m starting with is a whole genome, all sites VCF file. In order to extract the mitochondria, I wrote a python script available here. To use it, you should supply a vcf file (gzipped or not is fine), an output filename, and a chromosome of interest:

python vcf2fasta.py -v individuals.vcf.gz -g -o mitochondria.fasta -c mitochondria

This script will go through the VCF file and append the appropriate reference or alternate bases for each individual and output a valid fasta file.

In theory, we should be able to plug it into pegas now because it’s a fasta file with all individuals and each individual should have the same sequence length in the mitochondria, especially since it’s already been aligned to the reference. In practice, however, this is not the case. Some individuals will be missing bases, and sometimes the reference has weird issues where sites show up twice, etc. To fix this, I use clustal to align the fasta file again. Either clustal omega or clustalw should work, but clustal omega is supposed to be faster. After running clustal, you should output a fasta file where every individual has the same number of bases. Once this is done, we can finally use pegas to build a network.

input <- "MChloroplast.fasta"
d <- ape::read.dna(input, format='fasta')
e <- dist.dna(d)
h <- pegas::haplotype(d)
h <- sort(h, what = "label")
(net <- pegas::haploNet(h))
stack(setNames(attr(h, "index"), rownames(h))),
table(hap=ind, pop=rownames(d)[values])
plot(net, size=attr(net, "freq"), scale.ratio=0.2, pie=ind.hap)
legend(-8, 0, colnames(ind.hap), col=rainbow(ncol(ind.hap)), pch=19, ncol=2)

If everything goes right, you should get something like this:


The effect of excess reads and uneven coverage on viral sequence assembly

tl;dr: it makes it harder. Use IDBA_UD for assembly.

Code: https://github.com/arundurvasula/coverage-depth-assembly

One of the coolest results of RNA biology has been the ability to reconstruct full viral genomes from the sequencing of siRNAs found in eukaryotes. The Wu et. al. 2010 paper in PNAS (http://www.pnas.org/content/107/4/1606.full.pdf) demonstrated that this was possible in fruit flies, mosquitoes, and nematodes. One of the things I’ve been working on has been applying this to virus discovery in crop plants. This is a cool method, but one of the problems with this is that most assemblers weren’t build with this kind of data in mind. That is to say, most assemblers expect a sample with relatively even coverage across a genome.

When we try and sequence the siRNAs, we get massive amounts of coverage in some areas and really low levels of coverage in other areas. This makes it difficult for most assemblers to reconstruct a full genome. In order to get around this, one of the things I tried was subsampling and normalizing my reads. The idea here was to even out the coverage to a lower level so that our data is more in line with what assemblers expect.

However, this didn’t go at all like I expected. What I did in the code above was subsample or normalize my reads down to a specific coverage level and map the reads to a virus that I know is in the sample. Below you can see what happened:

Subsampling down to 50x
Coverage: 50x. Method: subsampling

Normalization down to 50x
Coverage: 50x. Method: normalization

Subsampling down to 5x
Coverage: 5x. Method: subsampling

Normalization down to 5x
Coverage: 5x. Method: normalization

As you can see from the graphs, normalization produces a much smoother level of coverage, but doesn’t really even out the coverage at all. This doesn’t help assembly at all. Subsampling performs even worse and makes coverage look pretty terrible (especially at the 5x level). The code to do this is located here in the repository under normalization.sh and subsampling.sh. Normalization was done with bbmap and subsampling was done with bioawk.

As it turns out, the best way to deal with this is to use an assembler that can handle uneven coverage and that’s where IDBA_UD comes in. This assembler is awesome and was designed with uneven coverage in mind. From the homepage:

IDBA-UD is a iterative De Bruijn Graph De Novo Assembler for Short Reads Sequencing data with Highly Uneven Sequencing Depth. It is an extension of IDBA algorithm. IDBA-UD also iterates from small k to a large k. In each iteration, short and low-depth contigs are removed iteratively with cutoff threshold from low to high to reduce the errors in low-depth and high-depth regions. Paired-end reads are aligned to contigs and assembled locally to generate some missing k-mers in low-depth regions. With these technologies, IDBA-UD can iterate k value of de Bruijn graph to a very large value with less gaps and less branches to form long contigs in both low-depth and high-depth regions.

This assembler has performed much better than most of the alternatives (Velvet, ABySS, CLCBio, SOAP, etc.) and I’d highly recommend it if you’ve got uneven coverage in your samples.

PCA with ANGSD and ngsCovar

As a follow up to the last post (PCA with samtools and plink) I’m going to go over how to do PCA using genotype likelihoods from ANGSD and with ngsCovar. This method isn’t integrated in to angsd-wrapper yet, but we are looking to add it.

The documentation is a little light for ngsCovar, but there is a tutorial here (shout out to Peter Fields for linking me to that).

PCA with ANGSD/ngsCovar is pretty straightforward once you know which files you need to generate. ngsCovar takes in a binary genotype likelihood file (.geno). This file is distinct from the .glf files that are also produced by ANGSD. To generate the .geno file, you need to supply these options:

angsd/angsd -bam data/all_samples.txt -GL 1 -out results/all.test -doMaf 2 -doMajorMinor 1 -doGeno 32 -doPost 1 -nind 21 -P 8 -r 1:

Specifically, you need -doGeno 32 and -doPost 1. These should generate a .geno.gz file. Once you have this, you need to unzip your .geno.gz file (ngsCovar won’t accept your zipped .geno file):

gunzip all.test.geno.gz

Now we can run ngsCovar and supply the genotype file. The other options will depend on your data, but the command should look something like this:

ngsPopGen/ngsCovar -probfile results/all.test.geno -outfile pop.covar -nind 21 -nsites 100000 -call 0

Then, following the graphing part of the tutorial we can get a nice graph of our population structure. In order to get what we need from the graph, we have to change the script a little bit:

Rscript -e 'write.table(cbind(seq(1,10),rep(1,10),c(rep("Indica",12),rep("Allopatric",4),rep("Sympatric",4),rep("BC1", 1))), row.names=F, sep=" ", col.names=c("FID","IID","CLUSTER"), file="test.pops.clst", quote=F)'

Specifically, we have to change the part that labels the samples to match what our samples actually are: c(rep("Indica",12),rep("Allopatric",4),rep("Sympatric",4),rep("BC1", 1).

Then we can run the plotPCA.R script provided just like in the example:

Rscript scripts/plotPCA.R -i pop.covar -c 1-2 -a test.pops.clst -o test.pca.pdf

And then you get a pretty graph that looks something like this:


PS: Following up from last time, we have included an individual that was supposed to be a backcross (BC1). This individual clusters very closely to the allopatric and sympatric populations and the same sample is an outlier. This suggests 1) ANGSD/ngsCovar and samtools/plink give pretty similar results (at least for this data) and 2) we have switched labels! The outlier should be the BC1 individual and the BC1 individual should be part of the sympatric population.

Population clustering with samtools SNP calling and Plink MDS

I was looking for a guide on how to do population clustering using samtools and plink but couldn’t find a concise guide on exactly what I need to do. So, after some hacking around, I figured it out and decided to write it down. In order to do MDS on bam files, you need to go through several steps. The general idea is:

bwa ref.fasta sample[0-9].fastq > sample[0-9].bam
samtools mpileup -uf ref.fasta sample[0-9].bam | bcftools view -bvcg - > samples.raw.bcf
vcftools samples.vcf -plink samples
plink --file samples --genome
plink --file samples --read-genome samples.genome --cluster --mds-plot 2

Then you can graph your result using R or whatever plotting software you fancy.

Mpileup: Mpileup will create a bcf file which is piped to bcftools, where the -c option will call SNPs. This file is written to disk and then converted to vcf files because vcftools doesn’t like the bcf output of samtools.

[Note: some bam samples removed to reduce complexity]

samtools mpileup -r 1: -uf students/Oryza_indica.ASM465v1.24.dna.genome.fa.gz students/og276.sorted.bam students/og278.sorted.bam | bcftools view -bvcg - > results/all.raw.bcf
bcftools view all.raw.bcf > all.raw.vcf

Convert vcf to ped: This step will take the SNP called vcf file and convert it to a format usable by plink (ped and map).

vcftools --vcf all.raw.vcf --plink --out all.raw

Make a .genome file: In order to do MDS, plink needs a .genome file of your samples. This can be created with something like this:

plink --file all.raw --genome --noweb --allow-no-sex --out all.raw

Do some multidimensional scaling:

plink --file all.raw --read-genome all.raw.genome --cluster --mds-plot 2 --noweb

d <- read.table("results/plink.mds", h=T)
d$pop = factor(c(rep("indica", 12), rep("allopatric", 4), rep("sympatric", 4)))
plot(d$C1, d$C2, col=as.integer(d$pop), pch=19, xlab="PC 1", ylab="PC 2", main = "MDS of Oryza (samtools/plink)")
legend("topright", c("Indica", "Allopatric", "Sympatric"), pch=19, col=c(2,1,3))
text(d$C1, d$C2, labels=c(rep(NA, 18), "og276", NA), pos=1)


If you’re wondering what’s going on with og276, I have no idea yet! It should cluster with the other sympatric samples, but it didn’t.

Searching for selection in the human genome with angsd-wrapper

The Nielsen lab has been working on a sweet program called ANGSD (Analysis of Next Generation Sequencing Data). It allows the calculation of many population genetic statistics from .bam files (along with a reference and ancestral genome). In order to make this program easier to use and to streamline analysis, I have been working on a set of wrapper scripts called angsd-wrapper. One feature of this wrapper that I’m excited about is the ability to graph the output using the awesome R package Shiny. Here, I’m going to run through using the software using widely available data from the 1000genomes project.

Download angsd-wrapper

$ git clone https://github.com/arundurvasula/angsd-wrapper.git 1000genomes

This code clones the angsd-wrapper repository into a folder called 1000genomes. There’s a readme in the repository, but a lot of the documentation is actually on the Github wiki page.

Download data:

Next, we can download some data into the data folder:


$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00097/alignment/HG00097.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00099/alignment/HG00099.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam

Human reference:

$ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/phase2_reference_assembly_sequence/hs37d5.fa.gz

Chimp reference:

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/panTro4/bigZips/panTro4.fa.gz

We’re almost done grabbing the data. Now we just need to create a file that tells ANGSD where the bams are located:



Cool! And then we create a file that hold the inbreeding coefficients in a file (assumed to be 0 for this case, but really should be estimated):



Now we’re really done.

Run the software

Okay, now we get to run the software! This is where angsd-wrapper attempts to make things easier. We eventually want to calculate theta and some neutrality test statistics. First, we need to create an SFS and then we can run the thetas part of ANGSD.

In order to do this, we modify the sfs_example.conf configuration file to point to our data and do a few other things:




Regions is set to “1:” so that we only look at chromosome 1 for now. Also, the ancestral sequence points to the human reference because ANGSD did not like how the chimp and human genomes aligned. We can run this with the following line of code:

$ bash scripts/ANGSD_SFS.sh scripts/sfs_example.conf

Once that’s finished, we should have a nice shiny file called EUR_Derived in the results folder. This file contains the site frequency spectrum. Now that that’s over, we can run the theta calculation workflow. Similar to above, the thetas_example.conf file is modified:




This file is pretty similar to the one above, but there are some options for pointing to the SFS and for the sliding window. This can be run in the same way as the SFS calculation:

$ bash scripts/ANGSD_Thetas.sh scripts/thetas_example.conf

Cool! Now we get to look at the data using the interactive Shiny graph!

Download data locally and download GFF and view in shiny

If you are using a cluster to run ANGSD, you will need to download the data to your local computer because Shiny is easier to deal with using a GUI. The data can be scp‘d over. Once you have it, you can navigate to the scripts folder in the terminal and run the Shiny application:

$ cd scripts
$ R
> library(shiny)
> runApp("shiny")

Then, load the data in the web app and wait for the graph (it’s a lot of data!). Hopefully, you get something like this:


At this zoom it doesn’t really tell us anything, but because the graph is interactive, we can restrict the viewing window to any region we want.

Later on we are going to localize our search to interesting regions of the genome.

Forward simulation pipelines

A lot my work recently has focused on using forward simulations to test demographic models. Often times, this requires thousands (or more) of simulations. Here, I’d like to talk about my approach to running these simulations and keeping track of everything.

We use a cluster with the Slurm job queue system and while each cluster has different quirks, the basic ideas are mostly the same. First, we use SLiM to simulate our populations. The basic pipeline is:

slim | msstats sim.stats.$JOBID.txt

Slim uses a configuration file to create its models. This complicates things a little bit. In order to maintain flexibility of our model (i.e. drawing mutation rate from a distribution or from data and setting complex population scenarios) I have created an R script to write a configuration file for slim to read from (can be read here).

This R script has a few features. First, it keeps track of the job ID from the queueing system so that we can keep track of the file names. Next, each numerical value is set independently, allowing us to draw any of them from a distribution.

It also has a function to shape how the population grows over time. For example, in addition to choosing the initial and final population sizes, we can choose how it gets to that final population size, whether it is linear or exponential growth.

After the configuration file is written and saved to disk, slim is called from within R and the simulation starts. The simulation is piped to an awk script that prepares the output for msstats, and the output of msstats is saved with the same job ID.

This approach allows for flexible simulations and makes it easy to perform thousands of them. The pipeline is available on Github (extensive documentation pending): https://github.com/arundurvasula/domestication_sims

Data management software proposal

I’m thinking of writing software to solve a problem. Before that, I wanted to write a proposal and find out what other people think about it. If you have any comments, please leave them at: https://gist.github.com/arundurvasula/bc9ea4506270f555d1b3 or as a comment here.

tl;dr: data management software that logs access to files.

1. The problem:

Modern data analysis relies on many sophisticated tools that perform a wide range of calculations on data. While software continues to evolve along with methods, data management still remains a complicated problem.

First, data analysis involves a lot of trail and error. One method may work well on one dataset, but it may not work as well on another. In its nature, data analysis must be done many times to arrive upon the best solution (if there is one). This process of trial and error, however is costly in time and organization. While solutions exist to mitigate these problems (for example, software that runs other software for you), these solutions are not complete.

Specifically, organization is difficult because there is no obvious and systematic way to keep track of what has been done to data. For example, when assembling sequence data many assemblers must be used with different options to find the optimal assembly method and options. While it’s possible to keep track of what one has done in a script, this method will not capture any data analysis done outside of the script.

Second, over time, it becomes difficult to remember what analyses have been done on data. This can be addressed by appending descriptors to the filename (e.g. sample1.trimmed.qual.mapped.bam). However, this quickly becomes unwieldy and fails to capture exactly what has happened to the file (including program options).

2.The solution:

The project suggested here is a daemon that watches data directories using the inotify API. It stores information about what processes and users read and modify data and stores it in a hidden json log file in that directory. It will also support arbitrary metadata used to describe the data in the same json log file. For example, it can store information about how and when the data was collected. This information need not be present in all data, which provides flexibility in describing the data.

Second, this project will provide a local webserver to access and modify the logs in a user friendly manner. Because the log format is standard json, we can build upon previous web applications to quickly build a web front end, similar to CouchDB’s Futon (http://docs.couchdb.org/en/1.6.1/intro/futon.html).

Using synthetic lethality for targeted cell death

A lethal mutation is one that spells certain death for an organism. These mutations can be induced by a researcher, or occur spontaneously between generations. Sometimes, a mutation isn’t lethal unless it is combined with another nonlethal mutation – a condition called synthetic lethality. This kind of interaction is important because it can cause a daughter cell to die unexpectedly. One example of this kind of interaction is a mutation in the BRCA1 or BRCA2 gene and a mutation in the Poly-ADP Ribose Polymerase (PARP) gene, leading to a loss of function. Independently, mutations in either one of these genes would not cause cell death. However, when both genes are mutated, a synthetic lethal interaction occurs, and the cells can die. Additionally, mutations do not have to occur for a synthetic lethal interaction to occur. If the action of PARP is inhibited (for example, by a small molecule), it is possible that the synthetic lethal result can still occur.

This type of interaction can be taken advantage of when cell death is desired, namely, cancer cell death. Scientists have observed that most cancer cells have mutations in cohesin related genes, which are genes that create proteins that oversee how chromosomes separate during cell replication.

The goal of a study by Jessica McLellen et. al. was to see if PARP inhibitors could lead to cell death in cancerous cells with mutations in their cohesin related genes. They approached this by looking at model organisms representative of human systems, namely yeast (Saccharomyces cerevisiae) and a nematode worm (Caenorhabditis elegans). These organisms have very similar replication systems, and as the authors hypothesized, would be representative of how human systems work. The authors looked for, and found, processes that were required for survival after cohesin mutations in yeast. They then checked for these same systems in C. elegans. They hypothesized that if these systems are conserved between the two species, they are probably conserved in humans as well. Using this information, the authors found that when cohesin is mutated, several proteins related to replication fork progress and stability were required. This lead them to hypothesize that proteins not found in yeast but found in C. elegans would be required for other higher eukaryotes. These proteins included the aforementioned PARP. After testing human cells, the authors found that cells with cohesin mutations were less likely to survive when treated with a PARP inhibitor. This is significant because cancerous cells are the ones that likely have cohesin mutations, and those ones can be made less viable by treatment with PARP inhibitors.

Cancer treatments that target PARP inhibition are already on the way and are in phase II of clinical trials. This research supports their effectiveness with evidence from model organisms that have very similar systems when compared with humans. While the exact mechanism that contributes to the lethality is not yet fully understood, this work shows that PARP inhibition is a viable target for tumors with cohesin mutations, which represents a significant percentage of colorectal, ovarian, and breast cancers.



McLellan JL, O’Neil NJ, Barrett I, Ferree E, van Pel DM, et al. (2012) Synthetic Lethality of Cohesins with PARPs and Replication Fork Mediators. PLoS Genet 8(3): e1002574. doi:10.1371/journal.pgen.1002574

Remote IPython notebook with Raspberry Pi

It recently stuck me that I’d really like to do some data analysis on my iPhone. This is pretty impractical because it’s a tiny screen and the keyboard often autocorrects and capitalizes things I don’t want it to, but I still wanted to do it because 1) I can, and 2) why not?.

To do this, I connected my Raspberry Pi up to the router, created a static IP address, set my router to port forward to that IP address, installed IPython and its friends, and set up a remote notebook. Pretty simple stuff!

Static IP

To set up the static IP, I basically followed these directions.

Port forwarding

I have a Netgear router so I just logged in at routerlogin.net, went to port forwarding, and forwarded to my Raspberry Pi’s static IP address at port 9999.

Installing IPython and friends

Since I’m using Raspbian, it’s as easy as:

$sudo apt-get -y install ipython-notebook
$sudo apt-get -y install python-matplotlib python-scipy \
python-pandas python-sympy python-nose

(copied from the IPython install guide)

Setting up a remote notebook

Setting up a remote notebook involved following these instructions.

  1. Set up a profile for the server:

$ipython profile create nbserver

  1. Create a password hash in python:

from IPython.lib import passwd

  1. Configure notebook server (change stuff in /home/user/.ipython/profile_nbserver/ipython_notebook_config.py)

c = get_config()

Kernel config

c.IPKernelApp.pylab = ‘inline’

Notebook config

c.NotebookApp.ip = ‘*’
c.NotebookApp.open_browser = False
c.NotebookApp.password = u’hash’
c.NotebookApp.port = 9999

  1. Then to start the server:

ipython notebook –profile=nbserver

The directions are explained much better at the link. But now, I just point my iPhone’s browser at my IP address, and BOOM! I’m ready to be distracted in class.

I also added the line to start up notebooks into rc.local file located /etc/rc.local so that the notebook server starts up each time I log in.
Now scientific computing on a 700 MHz CPU is not ideal, so next I’m going to figure out how to create a cluster of Raspberry Pis, but that will likely wait until summer.

Gene Ontology and Object Oriented Programming

The Gene Ontology (GO, http://www.geneontology.org) project is one that seeks to “standardize the representation of gene and gene product attributes across species and databases.” The way they do this is by creating a directed acyclic graph representing the functions and relationships of specific genes. Each node passed through increases the specificity of the child node. This is vaguely similar to how inheritance in object orientated programming.

In object oriented programming, inheritance promotes code reuse without duplication. The way this works (briefly) is, a base class is created containing methods and variables that classes could inherit from it. For example, if you were modeling vehicles, you could have a Vehicle class, which contains methods like move(), stop(), etc. and variables like moveSpeed, passengers, etc. This comes in handy when you make different types of vehicles like Bikes, Cars, and Trains. When you do that, you can inherit the methods like move() and stop() as well as the variables like moveSpeed and passengers. Thus, you only have to write those methods and variables once in the base Vehicle class. This image may be of help in visualizing it:

This is nice and tidy way of representing the world, only the world doesn’t exactly work like that. A great example is here, which explains how some of the principles of object oriented programming break down in video game programming.

The point is, not all objects can be placed in hierarchies with inheritance; sometimes you have groups that just don’t fit in anywhere and this turns the beautiful directed acyclic graph of inheritance into an ugly mess.

Back to gene ontologies, when storing our information about genes in a hierarchical manner, it’s possible that not all genes inherit the same properties as the nodes above it. However, I believe that GO has gotten around that by defining different relationships between nodes. For example, in this GO term, there are several relationships between nodes: is a, has part, is part, etc.

It’s possible that a similar method of inheritance could fix some broken models produced by objected oriented inheritance, but I don’t know of any languages that implement this fully (I also have not looked very much).