Sequencing low diversity libraries on Illumina MiSeq

After its launch in 2005, the 454 rapidly became the go-to technology if you wanted to sample diversity in amplicon libraries, whether a cancer panel, a viral quasispecies or microbial community profiling. It is not difficult to see why. Compared to Sanger sequencing the 454 offered massive throughput, being able to produce over a million reads per run at the relatively modest price of $10,000. This was an order of magnitude less than Sanger sequencing. And crucially, combining the instrument's high-throughput with barcode multiplexing permitted large numbers of samples to be interrogated on a single run at high coverage depth.

In microbiology and ecology, deep sequencing of 16S amplicon libraries using 454 is now the dominant method for phylogenetic profiling of microbes. Of the 2,210 publications listed on the 454.com website, 839 are in the category “Metagenomics and Microbial Diversity”. The “rare biosphere” in our bodies in health and disease was revealed for the first time. Environmental ecologists used the technology to interrogate hugely diverse environmental niches. Hundreds of new OTUs, often representing hitherto uncultured microbes were revealed for the first time.

Move over 454

Sadly, the pace of development of the 454 platform has stagnated in recent years following the Titanium upgrade in 2008. The long-promised upgrade to GS FLX+ “1kb reads” was late and under-delivered with reads more like 700-800 bases, and some users have reported dissatisfaction with the upgrade. Disappointingly the long read protocol is not supported when running unidirectional Lib-A sequencing, dramatically limiting its potential market. Nor is it available on the benchtop 454 GS Junior, although this may change in future.

But most critical is the apparent blind spot of Roche management to the rapidly dropping costs of sequencing on competitor platforms. The 454 has simply priced itself out of the market by being one to two orders of magnitude more expensive when costed per megabase compared to the Illumina and Life Technologies platforms.

New Platforms for Amplicon Sequencing

So for microbiologists wishing to do 16S sequencing, whether they are driven by cost-cutting, or by a desire to sequence more samples more deeply, it is now time to look around at alternatives. The MiSeq and the PGM are both promising platforms for 16S analysis given their competitive price points, and increasingly long reads (MiSeq 2x150bp, PGM 200bp - going to 2x250bp and 400bp respectively by the end of the year).

Sequencing low diversity libraries on Illumina MiSeq

We are moving to the Illumina MiSeq locally for 16S sequencing. For about £750 we generate over 5 million reads per run. By using paired-end sequencing at 150 bases we can design experiments which generate amplicons a little less than 300 bases and overlap them to generate long pseudo-reads. The error model is favourable compared to 454 as it does not suffer from frequent indel errors, meaning there is less need for expensive denoising steps such as PyroNoise.

However, there is a fly in the ointment. Amplicon sequencing on the Illumina platform has traditionally been problematic when sequencing so-called "low diversity" libraries such as 16S, resulting in low yields and lower per-base quality scores compared to sequencing more random libraries, e.g. from genomic DNA.

The good folks of Seqanswers have discussed this at length, and various work-arounds have been suggested. One commonly used approach is to spike in a genomic, higher-diversity sample, e.g. PhiX. The more PhiX spiked in, the better the results, but at the expense of the number of amplicon sequences generated. A second option is to add a sequence of N bases upstream of the 16S primer, resulting in the generation of random sequences. This however reduces the effective read length.

Solving the problem

We have been very fortunate in the past few weeks to welcome Josh Quick to our lab. He previously worked as an integration engineer at Illumina but has now decided to hone his skills as a bioinformatician. There's not much he doesn't know about Illumina sequencing, and he quickly introduced me to some tips for improving amplicon sequencing performance that were so impressive I asked him to share them here.

Over to you, Josh ...

There are 3 main areas in which low-diversity samples can cause you problems on MiSeq:

1) Focusing (every cycle) - the MiSeq focuses on the T channel with a fall back to the C channel, in practice as long as all the signal is not in the G channel you will be fine.  All other issues aside a very small PhiX spike in (~5%) is enough to prevent any focusing issues regardless of the composition of your library.

2) Template building (cycles 1 to 4) and registration (every cycle) - RTA uses images from the first 4 cycles to detect the positions of all the clusters.  You need to have some signal present in each channels for RTA to do template generation and registration properly.  Again a small PhiX spike in (~5%) is usually enough to prevent problems here provided density is <=700k.

3) Phasing/matrix estimation (cycles 1 - 12) - RTA estimates the average colour matrix over the first 4 cycles and the phasing over the first 12 cycles.  Low diversity samples can cause problems with both as the intensity is not evenly distributed across all channels as it is with genomic libraries.  As these are calculated in order to perform corrections a bad estimate here can cause your quality to start high then rapidly fall away, in these cases you might need a large PhiX spike in (~50%) to solve the problem.

Control lanes - on the GA/HiSeq 1) and 2) were still considerations (although each instrument focuses differently) however the use of a PhiX control lane eliminated problem 3).  On the MiSeq, having only a single lane means a control lane isn’t possible but there is a method for using ‘control’ conditions on MiSeq by modifying the RTA configuration file.

In my experience the most likely thing to go wrong is the phasing estimator, it will give a spuriously high phasing or prephasing number of >1% which means your quality starts off good then rapidly falls away.  However you can use a value based on a previous PhiX run, for example ours would be 0.0015/0.003.

The way to use ‘control’ matrix/phasing on the MiSeq:

(DISCLAIMER - this is not a configuration supported by Illumina so use it at your own risk)

Our MiSeq is running:

Locate your RTA configuration, ours is at:

C:\Illumina\RTA\Configs\MiSeq.Configuration.xml

Locate your control phasing and matrix files (previous PhiX run is ideal):

D:\Illumina\MiSeqTemp\RunFolder\Data\Intensities\BaseCalls\(Phasing|Matrix)\s_1(phasing|matrix).txt

Use a text editor to put the matrix and phasing values from these files into the MiSeq.Configuration.xml below the other options like this:

<HardCodedPhasing>
  <float>0.0015</float>
</HardCodedPhasing>
<HardCodedPrePhasing>
  <float>0.003</float>
</HardCodedPrePhasing>
<HardCodedColorMatrix>
  <ArrayOfFloat>
    <float>0.9339278</float>
    <float>0.07252103</float>
    <float>0</float>
    <float>0</float>
    <float>1.458246</float>
    <float>1.399187</float>
    <float>0</float>
    <float>0</float>
    <float>0</float>
    <float>0</float>
    <float>0.8679092</float>
    <float>0.03415901</float>
    <float>0</float>
    <float>0</float>
    <float>0.5764247</float>
    <float>0.988043</float>
  </ArrayOfFloat>
</HardCodedColorMatrix>

You need to have one float/ArrayOfFloat per read so the above would set the phasing, prephasing and matrix for a single read run, and below would set just the phasing for a dual index paired end run with four reads:

<HardCodedPhasing>
  <float>0.0015</float>
  <float>0.0015</float>
  <float>0.0015</float>
  <float>0.0015</float>
</HardCodedPhasing>

When the run starts check the RTA configuration file in your run folder to make sure it accepted the settings:

D:\Illumina\MiSeqTemp\RunFolder\Data\Intensities\RTAConfiguration.xml

This in most cases will enable you to use a significantly smaller amount of spiked in PhiX, you will still need 5% minimum to prevent problems arising from 1) and 2) and do not run at high density for amplicon work - 700k is the upper limit for difficult low diversity samples.  It is also possible to save the images for re-running RTA offline, this enables you to try different settings to find what works best.  The MiSeq.Configuration.xml setting for this is:

<CopyImages>true</CopyImages>

Good luck!

Update 6th September 2012: Some of the example values in the original post were wrong and have been corrected. However these were just illustrative, you should use the values from a test run on your local machine for this approach to be useful.

Update 31st October 2012: In the latest release of RTA (1.16) you no longer need to modify your RTAConfiguration.xml, instead save a copy of your control phasing/matrix files described above in the root of the RTA directory as phasing.txt and matrix.txt. RTA will fall back to the values in these files if it detects a low diversity sample.