An outsiders guide to bacterial genome sequencing on the Pacific Biosciences RS

27 Feb 2014

It had to happen eventually. My Twitter feed in recent times had become unbearable with the insufferably smug PacBio mafia (that's you Keith, Lex, Adam and David) crowing about their PacBio completed bacterial genomes. So, if you can't beat 'em, join 'em. Right now we have a couple of bacterial genomic epidemiology projects that would benefit from a complete reference genome. In these cases our chosen reference genomes are quite different in terms of gene content, gene organisation and have divergent nucleotide identities to the extent where we are worried about how much of the genome is truly mappable. And in terms of presenting the work for publication, there is a certain aesthetic appeal to having a complete genome.

And so, after several false starts relating to getting the strains selected and enough pure DNA isolated, we finally sent some DNA off to Lex Nederbragt at the end of last year for Pacific Biosciences RS (PacBio) sequencing!

This week I received the data back, and I thought it would be interesting to document a few things I have learnt about PacBio sequencing during this intiguing process.

It’s all about the library, stupid

The early PacBio publications in 2011 showing the results of de novo assembly of PacBio data weren’t great, giving N50 results not materially much better than 454 sequencing. This was despite the vastly longer read length achieved by the instrument, even then mean read lengths of 2kb were achievable. Since then, incremental improvements to all aspects of the sequencing workflow have resulted in dramatic improvements to assembly performance, such that single contig bacterial genome assemblies can be expected routinely. This is probably best illustrated by the HGAP paper published last year where single contig assemblies for three different bacterial species including E. coli were demonstrated. HGAP is PacBio's assembly pipeline.

The main improvements have been:

The use of Covaris G-Tubes for generation of long fragments (between 6-20kb)
The use of BluePippin, an automated gel electrophoresis size selector for for long reads (see Lex's blog post)
Improvements to the sequencing polymerase, for example to prevent against laser-induced damage and sequencing chemistry
Bioinformatics algorithm improvements, initially the use of Illumina/PacBio hybrid assemblies and then the HGAP pipeline which corrects the longest reads in the dataset with shorter reads before doing a traditional overlap-layout-consensus assembly.

You need a lot of DNA (like, a lot)

There is a trade-off between the amount of input DNA and which library prep you can do. 5 micrograms for 10kb libraries, ideally 10 micrograms for 20kb libraries. Not always a trivial amount to get your hands on, even for fast-growing bacteria. This is one of the things that would limit our use of PacBio for metagenomics pathogen discovery right now, because this amount of microbial DNA from a culture-free sample is basically impossible to get.

However, in fact we managed to get a library made from 2.6ug of DNA but in this case the BluePippen size cut-off had to be dropped to 4kb (from 7kb).

Input DNA	Library prep	BluePippen cut-off	Number of reads	Average read length	MBases
2.6ug	10kb	4kb	36 696	5529	202.9
			70 123	5125	359.4
>5ug	10kb	7kb	49 970	6898	334.7
			58 755	6597	387.6
>10ug	20kb	7kb	42 431	6829	289.9
			59 156	7093	419.6

Table 1. PacBio library construction parameters and accompanying output statistics (per SMRTcell, 180 minute movie)

(An aside, I wonder if Oxford Nanopore’s reported shorter than expected read length from David Jaffe’s AGBT talk may just be a question of feeding the right length fragments into the workflow. Short fragments in equals short reads out).

Choose the right polymerase

For bacterial de novo assemblies, all the experts I spoke to recommended the P4-C2 enzyme. This polymerase doesn’t generate the very longest reads, but is recommended for de novo assembly because the newer P5-C5 has systematic error issues with homopolymers (as presented by Keith Robison at AGBT). P5-C3 is therefore recommended for scaffolding assemblies, or could be used in conjunction with P4-C2.

Longer reads may mean reduced loading efficiency

You want long fragments, but we were warned by several centres that 20kb libraries load less efficiently than 10kb libraries, meaning throughput is reduced. It was suggested we would need 3 SMRTcells to get 100x coverage for an E. coli sized genome of 5mb. However in our case, it didn't really seem that was the case for us (Table 1).

Shop around for best prices

As you almost certainly can’t afford your own PacBio, and even if you could your floor wouldn’t support its weight, you will probably be using an external provider like I did. Prices vary, but the prices I had per SMRTcell were around £350 and quotes for 10kb libraries around £400, with 20kb libraries being more expensive. In the end I went with Lex Nederbragt and the Oslo CEES - not the very cheapest but I know and trust Lex not to screw up my project and to communicate well, an important consideration (see Mick Watson's guide to choosing a sequencing provider). In the UK, the University of Liverpool CGR have just acquired a PacBio and also would be worth a try. TGAC also provide PacBio sequencing. In the US, Duke provide a useful online quote generator and the prices seem keen.

What language you speaking?

It’s both refreshing and a bit unnerving to be doing something so familiar as bacterial assembly, but having to wrap your head around a bunch of new nomenclature. Specifically, the terms you need to understand are the following:

Polymerase reads: these are basically just ‘raw reads'
Subreads: aka 'reads of insert'. This is, I think, the sequence between the adaptors, factoring in that the PacBio has a hairpin permitting reading of a fragment and its reverse strand. This term also relates to the becoming obsoleted circular consensus sequencing mode. Lex has a description here: (http://flxlexblog.wordpress.com/2013/06/19/longing-for-the-longest-reads-pacbio-and-bluepippin/)
Seeds: in the HGAP assembly process, these are the long reads which will be corrected by shorter reads
Pre-assembled reads: a bit confusing, these are the seeds which have been corrected, they are only assembled in the sense that they are consensus sequence from alignment of short reads to long reads and that PacBio uses an acyclic graph to generate the consensus
Draft assembly: the results of the Celera assembler, before polishing with the read set

The key parameter for HGAP assembly is the Mean Seed Cutoff

[caption id="attachment_2014" align="alignnone" width="1024"] The seed length cutoff is the set of longest reads which give >30x coverage[/caption]

This parameter is critical and defines how many of the longer, corrected reads go into the draft Celera assembly process. The default is to try and get 30x coverage from the longest reads in the dataset. This is calculated from the genome size you specify, which ideally you would know in advance. If this drops below 6000 then 6000 will be used instead. You can also specify the mean seed cutoff manually. According to Jason Chin the trade-off here is simply the time taken to correct the reads, versus the coverage going in the assembly. I am not clear if there is also any quality trade-off. Tuning this value did seem to make important differences to the assembly (a lower cut-off gave better results). The HGAP2 (for it is this version you want) tutorial is helpful on tuneable parameters for assembly.

SMRTportal is cool, but flakey

I used Amazon m2.2xlarge (32Gb RAM) instances with the latest SMRTportal 2.1.0 AMI. About half the assembly jobs I started failed, with different errors, despite doing the same thing each time. Some times it worked with the same settings. I am not sure why this should be, maybe my VM was dodgy.

HGAP is slooooooow

Being used to smashing out a Velvet assembly in a matter of minutes, the glacial speed of the HGAP pipeline is a bit of a shock. On the Amazon AMI instance assemblies were taking well over 24 hours. According to Keith Robison on Twitter this is because much of the pipeline is single-threaded, with multi-threading only occurring on a per-contig basis. So if you are expecting a single contig you are bottlenecked onto a single processor. We therefore chose the m2.2xlarge instance type because the high-memory instances have the fastest serial performance of the available instance types. Actually this is important in a clinical context. Gene Myers (yes, THAT Gene Myers) presented at AGBT 2014 to say that he had a new assembler which can do an E. coli sized genome in 30 minutes, can't come soon enough as far as I'm concerned.

Single contig assemblies are cool

[caption id="attachment_2008" align="alignnone" width="874"] A very long contig, yesterday[/caption]

Well, my first few attempts have given me two contigs, but that is cool enough. And it is pretty damn cool. If money was no object (and locally we are looking at a 20:1 cost ratio for PacBio sequencing over Illumina) then I would get them every time. As it is, for now, we will probably confine our use to when we really need to generate a quality reference sequence to map Illumina reads against, for example when investigating an outbreak without a good reference. Open pan-genome species like Pseudomonas and E. coli are good potential applications for this technology, where you have a reasonable expectation of large scale genome differences between unrelated genomes. Our Pseudomonas genomes went from 1000 contigs to 2 contigs, which does make a huge difference to alignments. As far as I can see it is pointless to use PacBio for monomorphic organisms, unless you are interested in the methylation patterns. Keith Robison wrote recently and eloquently predicting the demise of draft genome sequencing, but whilst the price differential remains I think this is premature.

Polished assemblies still need fixing

Inspecting the alignment of Illumina reads back to the polished assemblies reveals errors remain, these are typically single-base insertions relative to the reference which need further polishing (Torsten Seeman's Nesoni would be a good choice for this)

The Norwegian Sequencing Centre rocks

I’m very grateful for Lex Nederbragt and Ave Tooming-Klunderud and the rest of the staff of the Norwegian Sequencing Centre in Oslo for their help with our projects, they have been very helpful and I recommend them highly. Send them your samples and first born!

Also many thanks to those on Twitter who have answered my stupid questions about PacBio particularly Keith Robison, Jason Chin, Adam Philippy, Torsten Seemann.

When I have more time I will dig into the assemblies produced and look a bit more about what they mean for both the bioinformatics analysis and biology.

Loman Labs