What are the potential sources of errors in sequencing datasets?

31 Oct 2012

From an e-mail discussion with C. Titus Brown.

I could think of the following:

PCR amplification errors (with amplicon sequencing, or all Illumina sequencing as there are usually 5-10 cycles of PCR performed before cluster generation):
- SNPs
- PCR chimeras (very likely to be a problem in a metagenomic dataset)
- Indels (sometimes strand-specific)
(Illumina) Cluster generation errors - probably similar to PCR as this is a PCR-like stage
(454/PGM) emPCR errors - again similar to PCR
(All platforms) Random sequencing errors
Systematic sequencing errors: seen in Illumina in the past; certain GC-rich motifs, inverted repeats, downstream of homopolymers SNPs, 454/PGM: homopolymers
Adaptor sequencing
Post-adaptor read through
Sample contamination
Genuine biological variation (not sequencing error, but could be confused)

Others??

Loman Labs