Curious results from 454 amplicon processing

So, I thought this was just about interesting enough to post here, but look away if you aren't a hardcore 454 data nerd.

I haven't done any 454 data handling off the instrument for a little while, but had to do some yesterday for a 16S amplicon run. The results were .. interesting.

Analysis #1: runAnalysisPipeAmplicons

We're running 2.5.3 of the Roche utilities right now. Here are the results of a full plate of 16S amplicon sequencing, processed using runAnalysisPipeAmplicons on default settings.

37mb of data! WTF!

We have 1.3m wells with beads in, and 1.23m of these are detected as having the key sequence "tcag" on then. So that's fine. But ~75% of these are thrown away by the short quality filter. Combined with the other filters in all there are only 95k usable reads in the dataset, less than 10% of the sequenced beads! Pretty bad. Note the large peak around 100 base pairs, this is primer-dimer, or more accurately it's primer-dimer-and-a-bit-of-16S-sequence.

So, I tried again with the whole-genome shotgun pipeline.

Analysis #2: runAnalysisPipe

Although you might expect the amplicon pipeline to work best with amplicon data, in our experience the whole-genome shotgun pipeline usually gives more useful results. This was the case here. There are some major differences in the signal-processing and base-calling parameters, which produces this rather different result:

Better, but by no means fantastic. 769k usable reads in the dataset comprising 206mb of data. As you can see from the read length distribution they are extremely varied in size. This is because they get trimmed at low-quality flows.

Can we do better? A tip-off on the ever-useful Seqanswers.com suggests to use the amplicon pipeline, but to try changing the settings on the valley filter. The valley filter looks at where flow intensities drop off and trims or discards the read as appropriate. Intensities tend to drop off towards the end of the read, and this relates to DNA strands on beads becoming "out of phase" and thus increasingly noisy during flows.

The valley filter setting it is recommended to change is "vfScanAllFlows". This can be done by dumping an XML file for amplicons (gsRunProcessor --template=filterOnlyAmplicons > template.xml) , and editing template.xml, setting the <vfScanAllFlows> argument to read "false" instead of "tiOnly". Then re-running the pipeline with runAnalysisFilter --pipe=template.xml

Analysis #3: runAnalysisPipeAmplicons, vfScanAllFlows=false

OK, this is much more what we wanted to see. We now have a size distribution which looks to mirror much better the amplicon sizes we put in the original sequencing library (16S molecules may vary in length depending on which species they came from), plus the primer-dimer crap.

What's going on then? I tabulated the differences between the WGS base calling and the amplicon basecalling.

Quality filter WGS Amplicons
doValleyFilterTrimBack True False "must be false for amplicons experiments
vfBadFlowThreshold 4 4
vfLastFlowToTest 320 320
vfMaxFailedPercent 100 100
vfTrimBackMinimumLength 84 84
vfTrimBackScaleFactor 1.1 2 (higher stringency)
vfScanLimit 4096 (disabled) 700
vfScanAllFlows ? tiOnly
Base caller
doQScoreTrim True False
errorQscoreWindowTRim 0.010
QScoreTrimNukeWindowSize 40
QScoreTrimBackScaleFactor 1.1
carryOverPeaks False

It seems to me that vfScanAllFlows being in "tiOnly" mode will throw out the read in its entireity if there are any bad flows in the read (please correct me readers, if I'm wrong). Clearly this is not desirable for 16S experiments. If it is set to "false" then the other vf* parameters come into play. When in WGS mode, reads are trimmed back when the valley filter is triggered (hence the large read length variation in Analysis #2). When in amplicon mode, reads are not trimmed back but simply tested and kept or discarded. In vfScanAllFlows=false mode then the vfLastFlowToTest comes into play.

A quick Google search on doValleyFilterTrimBack yielded this useful Roche document which gives more information about the control of these parameters. Now, I guess I should have read this in the first place, but I was mislead by the fact that amplicon processing used to "just work" and now seems to need tweaking (and produces such appalling default results). It seems that the default settings are optimised for short amplicons where the flows don't drop off so much.

So now the obvious thing to try, should I want to improve the quality of my results, is to bump up the value of vfLastFlowToTest, at the moment if flow 320 is good the read will pass. Perhaps a value of 720 would be better.

So, anyway I put this up here to help any others suffering from the same issue, and also expressing my surprise that the Roche software has changed so radically.

What settings do you use for signal-processing Titanium 16S data? And why are the yields still quite poor (I want >400mb of usable data per run) ?