Sequencing data: I want the truth! (You can't handle the truth!)

Two sequencing papers caught my eye this week.

This letter from Piskol and Li  is perhaps the final nail in the coffin for the heavily criticised and debunked (also see: GenomesUnzipped) RNA editing paper from Li and Cheung published in Science in early 2011 (as Thomas Keane said on Twitter: 'I can't believe people are still debating this!).

The letter Piskol and Li examined the claim of "non-canonical" RNA editing, i.e. post-transcriptional editing differing from the two known types, adenosine-to-insosine (A-to-I; I read as G) and the rare cytosine-to-uracil (C-to-U). Although a vast swathe of the claimed editing events had been debunked by previous studies, they examined 11 putative events which had been apparently validated by sequencing PCR amplicons using capillary instruments. What they found should be disturbing to sequence bioinformaticians:

They noticed that if you search each of these amplicon sequences using BLAT against the reference human genome, each one had a very similar, 'second-best' hit in the human genome. And lo, if you examine the sequence of those second best hits, the variant pointing to RNA editing wasn't present. They then designed primers to specifically amplify the region of the genome around the second-best hit and demonstrated that was in fact the likely template for the original sequencing read, and not the region associated with the best hit that originally hinted at RNA editing. Put simply, the RNA editing event wasn't an RNA editing event at all.

If you've done much sequence bioinformatics and variation detection you will know that alignment to paralogous regions of the genome (repeats) is a major reason for false positive SNP calls (perhaps the number one reason?). I see this frequently in the microbial genome projects I am involved in. As an aside, I bet this kind of analysis error happens all the time in published papers, but that they relate to findings not significant enough to attract extensive scrutiny-- discovering novel types of RNA editing would be a pretty big prize, in this case it was deemed worthy of a Nature paper. What is notable is that Sanger 'validation' also has the capacity to mislead if primers are not designed to unique regions of the genome.

That finding reminded me of an email I'd send to Titus Brown a few months ago, where he'd asked me to do some pre-publication peer review of a manuscript he'd written on possible sequencing artifacts causing problems with metagenomics assembly. I sent him a list of potential reasons for artifacts that may or may not explain his results, which I have reproduced and augmented here:

Library preparation errors Sequencing errors Analysis errors
PCR amplification point mutations (e.g. TruSeq protocol, amplicons) [1]</p>

emPCR amplification point mutations (454, Ion Torrent and SOLiD)

Bridge amplification errors (Illumina)

Chimera generation (particularly during amplicon protocols) [1]

Sample contamination

Amplification errors associated with high or low GC content

PCR duplicates</td>

Base miscalls due to low signal
Indel errors (particular PacBio)</p>

Base under- and over-calls with flow-based chemistries, associated with long homopolymers (454, Ion Torrent) [2]

Short homopolymer associated indels (Ion Torrent PGM) [2]

Post-homopolymeric tract SNPs (Illumina) and/or read-through problems [3]

Associated with inverted repeats (Illumina) [4]

Specific motifs particularly with older Illumina chemistry [4]</td>

Calling variants without sufficient reads mapping</p>

Bad mapping (incorrectly placed read)

Correctly placed read but indels misaligned
Multi-mapping to repeat/paralogous regions
Sequence contamination e.g. adaptors

Error in reference sequence

Alignment to ends of contigs in draft assemblies

Incorrect trimming of reads, aligning adaptors

Inclusion of PCR duplicates</td> </tr> </tbody> </table>

Phew! Are you sure you want to do some genome sequencing?!

I've included a few references here to relevant papers. Casey Bergman has started a Citeulike collection of papers relating to sequencer error profiles.

Now, thanks to a second paper published this week we have another item for the table (BTW please comment on my table and let me know what I've missed). This is a technical tour de force from the Broad Institute (ht @dgmacarthur) published in Nucleic Acids Research. Allow me to summarise:

Whilst searching for variants in cancer samples they discovered artifacts involving triplets of the pattern "C>A/G>T", occurring at low frequency in some cancer projects. Low frequency variants are of course of great interest in cancer genetics as the sample is genetically heterogenous, and any of these low frequency variants may be of interest as potential "drivers" of cancer progression which over time may become dominant. They may also represent clues to pathways which could be targeted with specific drugs.

However these artifacts seemed not to be real due to certain patterns spotted in the analysis; specifically, strand bias (significantly different patterns of forward/reverse orientations between the reads with variant calls and the non-variant calls) and their presence in both tumour and normal samples.

The impressive part of this study is that they then managed to track down the cause, and unlike the normal suspects in such cases, which include errors in PCR amplification, sequencing errors and alignment/analysis errors, they demonstrated that oxidation of DNA during the library preparation step - in this case acoustic shearing - generated 8-oxoguanine 'lesions' in the genome, which were responsible for these errors.

In order to confirm these were not sequencing errors they showed that the error was present on HiSeq V2, V3 and MiSeq chemistries as well as on the Ion Torrent PGM.

They developed a metric called "ArtQ" which was a probability of the error being present, akin to the phred score:

-10 x log10(consistent errors - inconsistent errors / all observations)

They considered an ArtQ score of >30 to mean the sample is unaffected by this problem. They then go onto  suggest an alternative library preparation with the inclusion of anti-oxidants in order to improve the ArtQ score, but they also suggest a bioinformatics based filter to exclude such mutations when this is not possible. Go read the rest of the paper, it's impressive stuff (despite the presence of 3-D barcharts, yuck!).

The conclusions of the paper are the ones I want to focus on. They state (emphasis mine):

The obvious deleterious effects  that the existence of such artifacts can have on the field of cancer research could be dramatic. If multiple common  processes in the laboratory can significantly alter the physical base sequence of DNA, it begs the question of  whether we can truly be confident that the rare mutations we are searching for can actually be attributed to true biological variation

They then warn that this may not be the only undiscovered artifact out there:

this is one of the myriad of possible  low frequency errors that could be induced during NGS sample preparation

They conclude that:

A systematic review of a wide variety of data obtained using different protocols from different laboratories needs to be undertaken by the sequencing community to identify whether there are any types of other artifacts that may be induced during extraction and/or library preparation that could be wrongly attributed to the biology of a given disease.

I couldn't agree more. Lex Nederbragt and I are working on a project we are calling SeqBench which we hope will start to address this problem by producing a well curated metadatabase of sequencing reads. By collecting high quality metadata we hope to be able to provide a useful testing resource which could be used to compare the results of different library preparation techniques, as well as the results from different sequencing platforms, aligners, assemblers and more. I am presenting a poster on this project at AGBT and plan to post more on the blog during the run-up to this meeting. I'd be delighted if this was something you would like to get involved with.

This is a draft blog post. I reserve the right to make changes to it until I remove this disclaimer, probably later on today. If you make useful comments or suggestions via the comments form or Twitter I'll happily change the post and give you a credit.

Thanks to Casey Bergman for proof reading and useful suggestions.

References

[1] http://pathogenomics.bham.ac.uk/blog/2010/08/come-on-feel-the-pyronoise/

[2] http://pathogenomics.bham.ac.uk/blog/2012/05/benchtop-sequencer-comparison-paper/

[3]  Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012 Jul 24;13:341. doi: 10.1186/1471-2164-13-341. PubMed PMID: 22827831; PubMed Central PMCID: PMC3431227.

[4]  Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak MC, Hirai A, Takahashi H, Altaf-Ul-Amin M, Ogasawara N, Kanaya S. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 2011 Jul;39(13):e90. doi: 10.1093/nar/gkr344. Epub 2011 May 16. PubMed PMID: 21576222; PubMed Central PMCID: PMC3141275.