The effect of excess reads and uneven coverage on viral sequence assembly

tl;dr: it makes it harder. Use IDBA_UD for assembly.

Code: https://github.com/arundurvasula/coverage-depth-assembly

One of the coolest results of RNA biology has been the ability to reconstruct full viral genomes from the sequencing of siRNAs found in eukaryotes. The Wu et. al. 2010 paper in PNAS (http://www.pnas.org/content/107/4/1606.full.pdf) demonstrated that this was possible in fruit flies, mosquitoes, and nematodes. One of the things I’ve been working on has been applying this to virus discovery in crop plants. This is a cool method, but one of the problems with this is that most assemblers weren’t build with this kind of data in mind. That is to say, most assemblers expect a sample with relatively even coverage across a genome.

When we try and sequence the siRNAs, we get massive amounts of coverage in some areas and really low levels of coverage in other areas. This makes it difficult for most assemblers to reconstruct a full genome. In order to get around this, one of the things I tried was subsampling and normalizing my reads. The idea here was to even out the coverage to a lower level so that our data is more in line with what assemblers expect.

However, this didn’t go at all like I expected. What I did in the code above was subsample or normalize my reads down to a specific coverage level and map the reads to a virus that I know is in the sample. Below you can see what happened:

Subsampling down to 50x
Coverage: 50x. Method: subsampling

Normalization down to 50x
Coverage: 50x. Method: normalization

Subsampling down to 5x
Coverage: 5x. Method: subsampling

Normalization down to 5x
Coverage: 5x. Method: normalization

As you can see from the graphs, normalization produces a much smoother level of coverage, but doesn’t really even out the coverage at all. This doesn’t help assembly at all. Subsampling performs even worse and makes coverage look pretty terrible (especially at the 5x level). The code to do this is located here in the repository under normalization.sh and subsampling.sh. Normalization was done with bbmap and subsampling was done with bioawk.

As it turns out, the best way to deal with this is to use an assembler that can handle uneven coverage and that’s where IDBA_UD comes in. This assembler is awesome and was designed with uneven coverage in mind. From the homepage:

IDBA-UD is a iterative De Bruijn Graph De Novo Assembler for Short Reads Sequencing data with Highly Uneven Sequencing Depth. It is an extension of IDBA algorithm. IDBA-UD also iterates from small k to a large k. In each iteration, short and low-depth contigs are removed iteratively with cutoff threshold from low to high to reduce the errors in low-depth and high-depth regions. Paired-end reads are aligned to contigs and assembled locally to generate some missing k-mers in low-depth regions. With these technologies, IDBA-UD can iterate k value of de Bruijn graph to a very large value with less gaps and less branches to form long contigs in both low-depth and high-depth regions.

This assembler has performed much better than most of the alternatives (Velvet, ABySS, CLCBio, SOAP, etc.) and I’d highly recommend it if you’ve got uneven coverage in your samples.