The Nielsen lab has been working on a sweet program called ANGSD (Analysis of Next Generation Sequencing Data). It allows the calculation of many population genetic statistics from
.bam files (along with a reference and ancestral genome). In order to make this program easier to use and to streamline analysis, I have been working on a set of wrapper scripts called angsd-wrapper. One feature of this wrapper that I’m excited about is the ability to graph the output using the awesome R package Shiny. Here, I’m going to run through using the software using widely available data from the 1000genomes project.
$ git clone https://github.com/arundurvasula/angsd-wrapper.git 1000genomes
This code clones the angsd-wrapper repository into a folder called 1000genomes. There’s a readme in the repository, but a lot of the documentation is actually on the Github wiki page.
Next, we can download some data into the data folder:
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00097/alignment/HG00097.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00099/alignment/HG00099.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
$ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/phase2_reference_assembly_sequence/hs37d5.fa.gz
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/panTro4/bigZips/panTro4.fa.gz
We’re almost done grabbing the data. Now we just need to create a file that tells ANGSD where the bams are located:
Cool! And then we create a file that hold the inbreeding coefficients in a file (assumed to be 0 for this case, but really should be estimated):
Now we’re really done.
Run the software
Okay, now we get to run the software! This is where angsd-wrapper attempts to make things easier. We eventually want to calculate theta and some neutrality test statistics. First, we need to create an SFS and then we can run the thetas part of ANGSD.
In order to do this, we modify the
sfs_example.conf configuration file to point to our data and do a few other things:
Regions is set to “1:” so that we only look at chromosome 1 for now. Also, the ancestral sequence points to the human reference because ANGSD did not like how the chimp and human genomes aligned. We can run this with the following line of code:
$ bash scripts/ANGSD_SFS.sh scripts/sfs_example.conf
Once that’s finished, we should have a nice shiny file called
EUR_Derived in the results folder. This file contains the site frequency spectrum. Now that that’s over, we can run the theta calculation workflow. Similar to above, the
thetas_example.conf file is modified:
This file is pretty similar to the one above, but there are some options for pointing to the SFS and for the sliding window. This can be run in the same way as the SFS calculation:
$ bash scripts/ANGSD_Thetas.sh scripts/thetas_example.conf
Cool! Now we get to look at the data using the interactive Shiny graph!
Download data locally and download GFF and view in shiny
If you are using a cluster to run ANGSD, you will need to download the data to your local computer because Shiny is easier to deal with using a GUI. The data can be
scp‘d over. Once you have it, you can navigate to the scripts folder in the terminal and run the Shiny application:
$ cd scripts
Then, load the data in the web app and wait for the graph (it’s a lot of data!). Hopefully, you get something like this:
At this zoom it doesn’t really tell us anything, but because the graph is interactive, we can restrict the viewing window to any region we want.
Later on we are going to localize our search to interesting regions of the genome.