Searching for selection in the human genome with angsd-wrapper

The Nielsen lab has been working on a sweet program called ANGSD (Analysis of Next Generation Sequencing Data). It allows the calculation of many population genetic statistics from .bam files (along with a reference and ancestral genome). In order to make this program easier to use and to streamline analysis, I have been working on a set of wrapper scripts called angsd-wrapper. One feature of this wrapper that I’m excited about is the ability to graph the output using the awesome R package Shiny. Here, I’m going to run through using the software using widely available data from the 1000genomes project.

Download angsd-wrapper

$ git clone https://github.com/arundurvasula/angsd-wrapper.git 1000genomes

This code clones the angsd-wrapper repository into a folder called 1000genomes. There’s a readme in the repository, but a lot of the documentation is actually on the Github wiki page.

Download data:

Next, we can download some data into the data folder:

Bams:

$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00097/alignment/HG00097.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00099/alignment/HG00099.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam

Human reference:

$ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/phase2_reference_assembly_sequence/hs37d5.fa.gz

Chimp reference:

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/panTro4/bigZips/panTro4.fa.gz

We’re almost done grabbing the data. Now we just need to create a file that tells ANGSD where the bams are located:

EUR_samples.txt

/home/adurvasu/1000genomes/data/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
/home/adurvasu/1000genomes/data/HG00097.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
/home/adurvasu/1000genomes/data/HG00099.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam

Cool! And then we create a file that hold the inbreeding coefficients in a file (assumed to be 0 for this case, but really should be estimated):

EUR.F.txt

0
0
0

Now we’re really done.

Run the software

Okay, now we get to run the software! This is where angsd-wrapper attempts to make things easier. We eventually want to calculate theta and some neutrality test statistics. First, we need to create an SFS and then we can run the thetas part of ANGSD.

In order to do this, we modify the sfs_example.conf configuration file to point to our data and do a few other things:

sfs_example.conf

UNIX_USER="adurvasu"
PROJECT_DIR=/home/$UNIX_USER/1000genomes/
ANGSD_DIR=${PROJECT_DIR}/angsd

ANC_SEQ=./data/hs37d5.fa.gz
REF_SEQ=./data/hs37d5.fa.gz
TAXON=EUR
REGIONS="1:"
DO_SAF=1
OVERRIDE=false
N_CPUS=16

Regions is set to “1:” so that we only look at chromosome 1 for now. Also, the ancestral sequence points to the human reference because ANGSD did not like how the chimp and human genomes aligned. We can run this with the following line of code:

$ bash scripts/ANGSD_SFS.sh scripts/sfs_example.conf

Once that’s finished, we should have a nice shiny file called EUR_Derived in the results folder. This file contains the site frequency spectrum. Now that that’s over, we can run the theta calculation workflow. Similar to above, the thetas_example.conf file is modified:

thetas_example.conf

UNIX_USER="adurvasu"
PROJECT_DIR=/home/$UNIX_USER/1000genomes/
ANGSD_DIR=${PROJECT_DIR}/angsd

ANC_SEQ=./data/hs37d5.fa.gz
REF_SEQ=./data/hs37d5.fa.gz
TAXON=EUR
REGIONS="1:"
PEST=results/${TAXON}_DerivedSFS
OVERRIDE=true
SLIDING_WINDOW=true
WIN=1000
STEP=500
DO_SAF=1

This file is pretty similar to the one above, but there are some options for pointing to the SFS and for the sliding window. This can be run in the same way as the SFS calculation:

$ bash scripts/ANGSD_Thetas.sh scripts/thetas_example.conf

Cool! Now we get to look at the data using the interactive Shiny graph!

Download data locally and download GFF and view in shiny

If you are using a cluster to run ANGSD, you will need to download the data to your local computer because Shiny is easier to deal with using a GUI. The data can be scp‘d over. Once you have it, you can navigate to the scripts folder in the terminal and run the Shiny application:

$ cd scripts
$ R
> library(shiny)
> runApp("shiny")

Then, load the data in the web app and wait for the graph (it’s a lot of data!). Hopefully, you get something like this:

Unknown
Unknown-2

At this zoom it doesn’t really tell us anything, but because the graph is interactive, we can restrict the viewing window to any region we want.

Later on we are going to localize our search to interesting regions of the genome.

Forward simulation pipelines

A lot my work recently has focused on using forward simulations to test demographic models. Often times, this requires thousands (or more) of simulations. Here, I’d like to talk about my approach to running these simulations and keeping track of everything.

We use a cluster with the Slurm job queue system and while each cluster has different quirks, the basic ideas are mostly the same. First, we use SLiM to simulate our populations. The basic pipeline is:

slim | msstats sim.stats.$JOBID.txt

Slim uses a configuration file to create its models. This complicates things a little bit. In order to maintain flexibility of our model (i.e. drawing mutation rate from a distribution or from data and setting complex population scenarios) I have created an R script to write a configuration file for slim to read from (can be read here).

This R script has a few features. First, it keeps track of the job ID from the queueing system so that we can keep track of the file names. Next, each numerical value is set independently, allowing us to draw any of them from a distribution.

It also has a function to shape how the population grows over time. For example, in addition to choosing the initial and final population sizes, we can choose how it gets to that final population size, whether it is linear or exponential growth.

After the configuration file is written and saved to disk, slim is called from within R and the simulation starts. The simulation is piped to an awk script that prepares the output for msstats, and the output of msstats is saved with the same job ID.

This approach allows for flexible simulations and makes it easy to perform thousands of them. The pipeline is available on Github (extensive documentation pending): https://github.com/arundurvasula/domestication_sims

Data management software proposal

I’m thinking of writing software to solve a problem. Before that, I wanted to write a proposal and find out what other people think about it. If you have any comments, please leave them at: https://gist.github.com/arundurvasula/bc9ea4506270f555d1b3 or as a comment here.


tl;dr: data management software that logs access to files.

1. The problem:

Modern data analysis relies on many sophisticated tools that perform a wide range of calculations on data. While software continues to evolve along with methods, data management still remains a complicated problem.

First, data analysis involves a lot of trail and error. One method may work well on one dataset, but it may not work as well on another. In its nature, data analysis must be done many times to arrive upon the best solution (if there is one). This process of trial and error, however is costly in time and organization. While solutions exist to mitigate these problems (for example, software that runs other software for you), these solutions are not complete.

Specifically, organization is difficult because there is no obvious and systematic way to keep track of what has been done to data. For example, when assembling sequence data many assemblers must be used with different options to find the optimal assembly method and options. While it’s possible to keep track of what one has done in a script, this method will not capture any data analysis done outside of the script.

Second, over time, it becomes difficult to remember what analyses have been done on data. This can be addressed by appending descriptors to the filename (e.g. sample1.trimmed.qual.mapped.bam). However, this quickly becomes unwieldy and fails to capture exactly what has happened to the file (including program options).

2.The solution:

The project suggested here is a daemon that watches data directories using the inotify API. It stores information about what processes and users read and modify data and stores it in a hidden json log file in that directory. It will also support arbitrary metadata used to describe the data in the same json log file. For example, it can store information about how and when the data was collected. This information need not be present in all data, which provides flexibility in describing the data.

Second, this project will provide a local webserver to access and modify the logs in a user friendly manner. Because the log format is standard json, we can build upon previous web applications to quickly build a web front end, similar to CouchDB’s Futon (http://docs.couchdb.org/en/1.6.1/intro/futon.html).