Ion Torrent: What is the impact of the new longer reads on assembly?

Ion Torrent have released a set of longer read 314 data, along with this technical note.

(Graphic from PRINSEQ)

The reads are indeed much longer than we have seen with our previous 316 runs, with a mean of 223bp and longest read being 398bp. Curiously this longer-read protocol has been done on a 314 chip rather than a 316 but Life inform me that it should work on the other chip just as well.

I thought it might be informative to assess how well this dataset performs in a real-world example, in this case de novo assembly.

The first question I wanted to look at is:

How useful are these long reads in improving assemblies?

I ran the SFF files from Life Tech's long-read 314 chip run (B14_387) and their 316 chip run (B13_328). Both data files are from the standard E. coli K-12 DH10B library.

For assembly I decided to use Newbler 2.6, run under default settings. I use Newbler because it runs fairly quickly and I'm impatient. For "real work" I tend to use MIRA.

[table id=1 /]

Above is an A50 plot. A50 plots are a way of visualising how contiguous and complete an assembly is. They are described in more detail at the great BioHaskell blog. Suffice to say, the steeper the slope, the better the assembly is.

So, in fact, the long-read 314 data set doesn't give as good an assembly (judged by metrics) as the short-read 316 data.

Is this due to the much lower depth of coverage (~14x vs 36x), or the increased error rate seen in the 314 long-read data(2.8% vs 1.5%) ?

What is the effect of depth of coverage on Ion Torrent assemblies?

Sub-sampling the larger 316 dataset to generate reads approximating different depths of coverage from 10x to 35x (assuming a 4.7Mb genome) we can test the effect of reducing coverage levels on the higher quality 316 data. This gives the following A50 curves:

[table id=2 /]

Actual coverage is the number of aligned bases in the assembly divided by the theoretical size of the genome (4.7Mb) and thus represents bases that made it through to the assembly rather than simply the number of bases in the input.

So it seems that the longer read set perform worse in assembly than the shorter reads, at nearly identical depth of coverage.

Higher error rate can be seen to have a deleterious effect on the percentage aligned bases (96 vs 90%) and negatively affects the resulting assembly.

My conclusion is that the higher error rate seen in the long-read data results in this less contiguous assembly, ameliorating the effects of the longer read. It is possible that Newbler is clipping the low quality tails from the data. I would like to repeat this analysis with MIRA.

However, I suspect at a much higher coverage level these effects will start to melt away and the assemblies will improve significantly compared to the shorter data. It would be nice to have some more long-read data to throw into the mix to test this prediction.

Coverage over a certain threshold does not improve contig statistics

Another finding here is that although the 20x coverage genome looks significantly better than the 15x genome there is not much improvement in the assembly metrics when going up from 20x to 25 - 35x.

This is down to repeats in the E. coli genome. When there are repeats in the genome longer than the read length, this becomes the bottleneck for getting longer contigs. The only way out of this problem is to get longer reads, or to use a mate-pair protocol. Piling a load more reads will not make much difference to the assembly at this point.

(This is why it is a bit pointless to do what many do and add a huge pile of short reads to a 454 assembly unless your specific aim is to correct indel errors from homopolymers).

Shooting for 20x coverage means that you could aim to sequence perhaps two or three 4Mb genomes per 316 chip (assuming a yield of 200-250Mb which is what we've seen) or one 4Mb genome on a very good 314 chip (now priced at $99).

In the next blog post I will look more carefully at the resulting assemblies to assess quality at different coverage levels, and also compare with the Illumina HiSeq data.

Update 10th August 2011

Thanks for all your comments. This data set does seem curious.

I wanted to just check that these results weren't just Newbler 2.6 being weird, so I also assembled just the 314-long dataset with MIRA 3.2.1.17_dev and CLC Genomics Workbench 4.7.2. The results were pretty meh;

[table id=3 /]

I feel I should add - just in case there is any confusion - that this is not a criticism of either assembler which usually give similar results to Newbler when run on a regular dataset.

A reader, aeonsim asked in the comments what happens if you combine the long and short read dataset. Actually the long read set when added to the 316 data worsens the assembly, despite increasing the coverage.

[table id=4 /]

Lex Nederbragt has verified my findings, and had a closer look at the reads and although he finds signal suggestive of homopolymer errors towards the ends of the reads, this does not fully explain why the reads don't assemble well. He compares to a 454 dataset and finds the Ion Torrent reads lacking.

--

Comments, criticisms as always welcome. Please post in the comments below rather than by personal e-mail if your organisation allows it.