Genome sequencing platforms compared for bacterial de novo assemblies

Wow, I haven't blogged for ages. Partly this is the usual excuse of not having time, and partly a lack of inspiration. Sorry. Perhaps just before Xmas is the wrong time to get my mojo back, but I guess that's the way life is.

So what's been happening? Well, on the sequencing front we've recently been celebrating getting single-scaffold assemblies for bacterial genomes, a grand total of 4 in a week! This was achieved with the 454 8kb paired-end protocol and 454 WGS data. I know lots of other groups have done this, but it is very satisfying when it happens to you!

That brings me on to some results which I thought were interesting enough to share. Mike Halachev, my fellow developer on the xBASE project was importing the latest batch of bacterial genomes deposited in GenBank and noticed that the COMMENT block often reveals the sequencing platform, coverage depth and assembler used. Needless to say, like a good bioinformatician, he decided to graph the results and see what they showed us.

Firstly, incomplete bacterial genomes submitted to NCBI over the past 12 months (fig 1).

Figure 1

Out of an amazing 514 projects, the majority of people preferred to use 454 for sequencing (286), about half as many used Illumina (144) and most of the rest went for a hybrid 454/Illumina approach. SOLiD (ABI) was used almost as much as Sanger, i.e. not a lot. This is kind of what I would expect, the 454 is a good and tested platform for de novo assembly of microbial genomes. But I might have expected more Illumina deposits given that the large sequencing centres are so focused on this instrument. Some bacterial resequencing studies only do mapping and so the reads end up in the Sequence Read Archive, not covered by these data. I expect the balance to shift in the next 12 months a little towards Illumina.

Coverage for different platforms (fig 2).

Figure 2

As you might expect, Illumina assemblies have an average greater coverage (median 67x) versus 454 assemblies (25x) reflecting the increased throughput of these instruments. SOLiD is a bit skewed by the 7 Listeria genomes submitted by Life Tech, each at >200x coverage. For 454 quite a range of coverage depths are see nfrom 10x but going up to 200x. It's a bit of a waste of money getting that much coverage. For Illumina the range is higher and narrower, concentrating around the 60x mark.

In terms of number of contigs (fig 3) it is surprising and notable that the 454 and Illumina contig numbers are comparable despite the difference in read-length.

Figure 3

Of course 454 covers GS 20, GS FLX and Titanium read lengths and Illumina can be run fragment or paired-end from 25 - 125 base pairs, so the comparisons are not direct. I would presume most of the Illumina sequences used paired-end sequencing which produces the equivalent of ~250bp reads. The 454/Illumina hybrid assemblies are not obviously better with some being much worse which I think reflects the lack of a decent assembly pipeline for combining these data. The SOLiD assemblies are pretty bad, reflected in those Listeria Life Tech sequences again. These data may be skewed by the fact many people omit their really small contigs when depositing in GenBank. N50 would be better but I don't have that information.

Plotting coverage / number of contigs (fig 4) you can see a truth that is still unpalatable to some people (forgive me for not doing any linear regression here) - increasing coverage beyond a certain point (I think about 15x for 454) doesn't mean you get fewer contigs. For those raised on Sanger sequencing and Lander-Waterman statistics this is a bit of a surprise. When planning an experiment it is important to realise that the assembly will never be in fewer contigs than there are repeat regions in the genome (longer than the read length). It's impossible without some manual finishing or guessing against a reference. If you add in scaffolding this is still true but contigs can be oriented and gap lengths defined.

Fig4a

Update: And for Lex Nederbragt who took the time to post in the comments, here's a log/log scale. It strikes me that a few of the genome projects labelled 'ABI' are likely Sanger, and the ones you can see in the top right are SOLiD. I'd be inclined to ignore the outliers which look like they result from mistakes when filling in NCBI's genome project submission form.

Finally, what assemblers are in use? Well there is really only two contenders for the crown of most popular assembler for bacterial data: Newbler for 454 data (does a good job, in my experience) and Velvet for Illumina / SOLiD data. Celera is popular, but mainly at JCVI for obvious reasons. I find it interesting that few other short-read assemblers get a look in, especially as there are heaps of them.

Figure 4Well, I hope you found that interesting, and I promise not to leave it so long for my next post!