Ion Torrent data blog post; a week is a long time in genomics

My last blog post looking at de novo assembly with Ion Torrent did over 2,000 views (a big number in these parts) and shows what huge interest there is in Ion Torrent data.

Despite my protestations that this was not an assembly software comparison, a few people took it that way. I guess that's natural. But my main objective was to provide some initial results from assembling Ion Torrent data. Clearly picking a single assembler at random would have been bad practice.

CLC Bio wrote a press release saying they had the fastest and best assembler. To be fair they did ask me if they could quote my blog and I said sure - as long as they linked to it (I'd say to the same to anyone). I respect their right to market their product, its a tough world out there.

I've had some private discussions with people who have made a few good points about genome assembly.

One point was that a large N50 value does not tell you everything about assembly quality. I completely agree and there is always a risk that improved statistics come at the expense of accuracy.

Another issue is that N50s can be improved by changing assembly parameters for relaxed stringency. However I do suspect users will not be minded to do extensive parameter scans by hand when trialling assemblers.

One can expect the N50's to improve with greater depth of coverage than 10x. Ideally I'd want >15x coverage (20x would be great) on real-world data. We should have our own reads soon so can test this out.

An extremely impressive reaction to the blog was from Bastien Chevreux, author of MIRA. He maintains this free (and excellent) assembler in his spare time. He spent the weekend knocking up an update for explicit Ion Torrent support. Version 3.2.1.17 has the rather impressive effect of ramping the contig N50 to nearly 10kb, so I have to say I'm very impressed. He also has some early support for PacBio, another great achievement.

Lex Nederbragt helped me debug the issue with Newbler which made it take days to run. This seemed to relate to using the '-rip' setting. It's not nearly as slow with this option omitted.

"SeqWiz" asked me why I had neglected to trial Geneious - no particular reason, I just forgot it. I've added it in and the early results look good.

My next update will look at the accuracy of Ion Torrent assemblies including the important issue of homopolymers.

PS A few people have been having problems getting their hands on the test dataset. I spoke to Life Tech about it and they said they were vetting applications for the community site, so you are much more likely to get access if you sign up with an academic email address. If you don't get access in a day or two, I'd drop them a line telling them who you are and why you want the data.