Come on feel the PyroNoise

slade102454 sequencing technology has revolutionised the field of microbial ecology by providing a means to sequence tens of thousands of partial 16S rDNA sequences quickly and efficiently. However, this new capacity brought new problems to a field fraught with potential sources of bias. Early analyses of microbial communities using 454 data tended to overstate the number of OTUs in a given sample, leading to the realisation that a thorough understanding of the sequencing error model was crucial to reliable analysis.

One of the people instrumental in bringing some order to the chaos is Chris Quince at the University of Glasgow who has tackled several important sources of bias with his AmpliconNoise (previously PyroNoise) software. Chris published an important Nature Methods paper in 2009 on this software and has been rewarded with several new grants recently to carry on his good work - check out his group website for details.

So it was with great interest therefore I attended a workshop on 16S analysis several months back on PyroNoise hosted at the University of Newcastle. I wanted to write a few notes on this workshop hoping they will be useful to the wider community.

One of the most useful sessions was  Chris’ description of PyroNoise. He defined what he regards as noise and how you can try to stop this noise corrupting your analysis. The key sources of noise are:

Before we go any further: of course 16S analysis is not purely a technical exercise about eliminating noise. 16S analysis requires a decent experimental design and without that, no amount of denoising will save your analysis. There are many more sources of bias, including sampling bias (did we sample enough communities and were these representative?), sample preparation bias (did the cells all lyse their genomic contents equally, are we counting dead cells as well as live?), amplification bias (so-called ‘universal’ primers are probably anything but) and sequencing bias (e.g. high GC may not sequence well).

And to make things even more complicated, the very act of filtering and de-noising your sample can add an additional source of bias. For example, a de-noising procedure that filtered out reads containing homopolymers may reduce the number of true OTUs in your sample.

But back to the de-noising process. AmpliconNoise is divided into three distinct steps. The first step is performed on the flowgram data and the idea here is to remove noise introduced by 454 sequencing. A practical consideration here is that you need the SFF files from the 454 run, the FASTA+QUAL files will not be sufficient as the software works in flow-space. It works by calculating pair-wise distances using flow signal intensities. All-vs-all pair-wise comparisons are performed and the sequences are then clustered and the “true” sequences in the sample are determined by an Expectation-Maximum algorithm. In simplistic terms the idea is that if we know all the sequences and their relative frequency, we can use a Bayesian approach to estimate the probability that any given read was generated by a given sequence.

Chris has been smart here by trialling this out using real 454 data rather than simulated data, but crucially he used fake populations generated by mixing genomic DNA (rather than cells). On a mixture containing genomic DNA from 90 separate taxa he demonstrated that a naive OTU prediction approach generates about 1000 OTUs, when defining OTUs as any change in sequence (i.e. 0% cut-off).

After applying the initial de-noising procedure in flow-space this drops by about a half, an improvement but still a huge overestimate.

The rest of the noise is independent from the noise generated from pyrosequencing. Chris proposed that this noise comes from PCR amplification. However, a number of people at the workshop pointed that rRNA is multicopy in most species and heterogenous in many. This is also a potential source of OTU estimation, and despite the availability of the rDNA copy number database, I am not sure that people account for this routinely when looking at relative taxonomic abudance.

The second denoising step of PyroNoise works in sequence space and is the most expensive part of the algorithm in terms of running time. Briefly the sequences are clustered and a distance is defined that reflects the probability of a sequence being generated by PCR errors given a true sequence T. To account for indels Needleman-Wunsch alignment is performed for each pair which is why this step takes so long, typically requiring a small cluster for several days to analyse a whole run of 454 data.

This sequence clustering step improves the estimate of OTUs dramatically, nearly down to the expected 90 taxa in the original mix.

The final source of noise are chimeric sequences and Chris has developed an algorithm he calls Perseus to deal with this. Chris’ solution is ingenious, I am not sure he has published it yet so I won’t go into details. There is also a module in Mothur called Chimera.slayer which operates on a similar principle.

After these three stages are complete, the predicted number of OTUs is very close to 90, the actual number of OTUs. But are these OTUs the right ones when classified against a taxonomy? It turns out they are, but only when an OTU cut-off of 1.5%-3% is set.

There’s an awful lot more to say about 16S analysis and I, like many others, am learning my way around this exciting field. But there is no doubt that to perform 16S analysis successfully, every step of the process needs to be thought about in some detail. Trusting the results of an automated 16S pipeline is a sure-fire way of getting burnt. On that note, I plan to blog about what I learnt about the two most popular 16S analysis pipelines, Mothur and QIIME in a subsequent article.

References

Quince, C., Lanzén, A., Curtis, T., Davenport, R., Hall, N., Head, I., Read, L., & Sloan, W. (2009). Accurate determination of microbial diversity from 454 pyrosequencing data Nature Methods, 6 (9), 639-641 DOI: 10.1038/nmeth.1361