Searching for selection in the human genome with angsd-wrapper

The Nielsen lab has been working on a sweet program called ANGSD (Analysis of Next Generation Sequencing Data). It allows the calculation of many population genetic statistics from .bam files (along with a reference and ancestral genome). In order to make this program easier to use and to streamline analysis, I have been working on a set of wrapper scripts called angsd-wrapper. One feature of this wrapper that I’m excited about is the ability to graph the output using the awesome R package Shiny. Here, I’m going to run through using the software using widely available data from the 1000genomes project.

Download angsd-wrapper

$ git clone https://github.com/arundurvasula/angsd-wrapper.git 1000genomes

This code clones the angsd-wrapper repository into a folder called 1000genomes. There’s a readme in the repository, but a lot of the documentation is actually on the Github wiki page.

Download data:

Next, we can download some data into the data folder:

Bams:

$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00097/alignment/HG00097.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00099/alignment/HG00099.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam

Human reference:

$ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/phase2_reference_assembly_sequence/hs37d5.fa.gz

Chimp reference:

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/panTro4/bigZips/panTro4.fa.gz

We’re almost done grabbing the data. Now we just need to create a file that tells ANGSD where the bams are located:

EUR_samples.txt

/home/adurvasu/1000genomes/data/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
/home/adurvasu/1000genomes/data/HG00097.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
/home/adurvasu/1000genomes/data/HG00099.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam

Cool! And then we create a file that hold the inbreeding coefficients in a file (assumed to be 0 for this case, but really should be estimated):

EUR.F.txt

0
0
0

Now we’re really done.

Run the software

Okay, now we get to run the software! This is where angsd-wrapper attempts to make things easier. We eventually want to calculate theta and some neutrality test statistics. First, we need to create an SFS and then we can run the thetas part of ANGSD.

In order to do this, we modify the sfs_example.conf configuration file to point to our data and do a few other things:

sfs_example.conf

UNIX_USER="adurvasu"
PROJECT_DIR=/home/$UNIX_USER/1000genomes/
ANGSD_DIR=${PROJECT_DIR}/angsd

ANC_SEQ=./data/hs37d5.fa.gz
REF_SEQ=./data/hs37d5.fa.gz
TAXON=EUR
REGIONS="1:"
DO_SAF=1
OVERRIDE=false
N_CPUS=16

Regions is set to “1:” so that we only look at chromosome 1 for now. Also, the ancestral sequence points to the human reference because ANGSD did not like how the chimp and human genomes aligned. We can run this with the following line of code:

$ bash scripts/ANGSD_SFS.sh scripts/sfs_example.conf

Once that’s finished, we should have a nice shiny file called EUR_Derived in the results folder. This file contains the site frequency spectrum. Now that that’s over, we can run the theta calculation workflow. Similar to above, the thetas_example.conf file is modified:

thetas_example.conf

UNIX_USER="adurvasu"
PROJECT_DIR=/home/$UNIX_USER/1000genomes/
ANGSD_DIR=${PROJECT_DIR}/angsd

ANC_SEQ=./data/hs37d5.fa.gz
REF_SEQ=./data/hs37d5.fa.gz
TAXON=EUR
REGIONS="1:"
PEST=results/${TAXON}_DerivedSFS
OVERRIDE=true
SLIDING_WINDOW=true
WIN=1000
STEP=500
DO_SAF=1

This file is pretty similar to the one above, but there are some options for pointing to the SFS and for the sliding window. This can be run in the same way as the SFS calculation:

$ bash scripts/ANGSD_Thetas.sh scripts/thetas_example.conf

Cool! Now we get to look at the data using the interactive Shiny graph!

Download data locally and download GFF and view in shiny

If you are using a cluster to run ANGSD, you will need to download the data to your local computer because Shiny is easier to deal with using a GUI. The data can be scp‘d over. Once you have it, you can navigate to the scripts folder in the terminal and run the Shiny application:

$ cd scripts
$ R
> library(shiny)
> runApp("shiny")

Then, load the data in the web app and wait for the graph (it’s a lot of data!). Hopefully, you get something like this:

Unknown
Unknown-2

At this zoom it doesn’t really tell us anything, but because the graph is interactive, we can restrict the viewing window to any region we want.

Later on we are going to localize our search to interesting regions of the genome.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s