Assessing Ion Torrent assembly quality with Mauve Assembly Metrics

15 Aug 2011

[gplusbutton]

I wanted to assess the quality of the assemblies that you might expect to get from assembling Ion Torrent reads de novo for bacterial genomes (see my last post for the initial results of assembly).

By good fortune, Aaron Darling of Mauve fame just published a manuscript in Bioinformatics describing Mauve Assembly Metrics.

For those not familiar with Mauve it is both a useful whole-genome alignment algorithm (progressiveMauve) and an attractive genome viewer optimised for n-way genome comparisons. It works pretty well for bacterial genome comparisons and should be in your toolkit if it isn't already.

One thing that is not immediately intuitive about Mauve is the way it handles repeats - in comparison to ACT/BLAST it won't align repeat sequences that can't be placed unambiguously. A typical example is the ribosomal RNA operon which often comes out as an ~8kb contig with much greater depth of coverage than the rest of the genome. Mauve won't be able to place this contig because it could go in multiple places in the genome. This will affect the results of the assembly metrics as well as making you think your genome is "missing" a piece that is in fact there.

Mauve Assembly Metrics takes your draft genome sequences (in contigs) and aligns it against a reference, reordering the contigs and producing reports in both tab-separated values format and accompanying PDF graphs. This is achieved through an R script.

The software comes with an example document which explains the output better than I could.

Some of the metrics offered by the software that I think may be interesting are:

Number of miscalls, i.e. SNPs/indels
Number of break-points, i.e. misassemblies within contigs
Number of genes in the reference which are broken, e.g. by a contig end
Gaps in the reference /extra bases
Missing sequence due to GC

It seems probable that the best results are going to be obtained when aligning against a close reference. And in theory, the test reads from E. coli K-12 DH10B should be very close to the reference genome. It is worth mentioning that the DH10B sequenced by Life Tech may not be exactly the same as the one sequenced by Blattner. We may expect SNPs to accumulate over time, e.g. from passage, and indeed this strain is notable for an increased rate of mutation when compared to K-12 MG1655. This is thought to be due to more frequent insertion sequence transpositions. This goes some way to explaining why "the E. coli genome" is not specific, nor is the "E. coli K-12 genome". In fact, can you ever sequence "the genome" of anything?

A caveat with Mauve's approach is that the system cannot readily detect the differences between alignment errors and sequencing errors. This is likely to be more of a problem when aligning against more divergent references (we were debating on Twitter whether it would work across different eukaryotic species).

So, let's look at some results.

[table id=5 /]

Interesting! A few things stand-out here. There are some expected trends here - for example, the number of gaps in the reference and the number of gaps in the assembly both reduce as we layer on extra coverage. Also the number of breakpoints/misassemblies as gauged by the "double-cut distance" also reduces.

One striking statistic is that even with the best assembly, 203 genes are broken. This is due to contig ends. This means that 5% of the genes in the genome are not enclosed by individual contigs. The worst assembly (at 10x) has almost 15% of coding sequences disrupted by a contig break. This is going to make life a bit difficult when trying to confidently annotate a genome.

Miscalls

Miscalls I think refer to SNPs and short indels.

The number of apparent miscalls goes down as we layer on extra coverage to 20x, but rather curiously we then see an increase in number of miscalls going from 25x to 25x. The 314-long dataset has the most miscalls by far.

Let's try and figure out what's happening here. A useful chart maps this across the genome:

A few things stand out here. Firstly that you can see that frequency of miscalls in the 10x genome and the 314-long set are significantly higher than the other datasets and this is across the genome.

However, the high coverage datasets have a sharp peak of miscalls around 800kb into the genome. This is the source of the majority of those SNPs and the reason we get an uptrend with increased coverage.

Looking at this alignment in Mauve it is a very low nucleotide identity alignment - and BLASTing the contig back against the reference shows in fact there is a much better place to map the contig, giving 100% identity. So these are false SNP calls. I am not sure why it didn't get picked up by Mauve, perhaps tweaking the parameters would fix it.

I decided to strip this region out (by excluding SNPs occurring >5 times in a sliding window 1,000 bp long, shifted by 100bp). I also discarded SNPs near a contig boundary (<100bp) as this is another source of misalignments.

Then you get the following result:

That looks a bit more rational. A quick look at the assembly for the 25x coverage genome doesn't immediately indicate whether these are genuine errors or simply differences from the reference. They are not obviously associated with long homopolymeric tracts and there is often a good, deep consensus. Something for another blog post.

One other thing that is interesting in this plot:

There is aregion where there are no miscalled bases in the genome, between 500-750kb in the reference. This corresponds to a large 120kb tandem duplication in the DH10B genome. Because we only have a single consensus contig for this repeat Mauve cannot place it correctly and that region ends up being unaligned. It's also the major contribution to the surprising finding that almost 350kb of the genome is not covered, equating to 8%. The truth is it is covered, but not ambiguously.

So this is an example where misalignment can mess your results up.

GC Biases

One more plot, before I sign off.

Not sure why the background GC plot is so "wiggly". But let's assume its a normal distribution.

It seems missing regions are not generally related to GC content, except in the 314-long dataset. I wonder if this reflects some change to the chemistry to coax out longer reads, e.g. the emPCR reaction conditions?

Conclusions

Mauve Assembly Metrics gives a useful guide to assembly quality, however misalignments may skew results
The test Ion Torrent 316 data reaches a decent consensus sequence around 20x coverage
Long repeats are always an issue for mapping de novo assemblies
The 314-long dataset has plenty of miscalls, further evidence this is not a high-quality data set

Darling, A., Tritt, A., Eisen, J., & Facciotti, M. (2011). Mauve Assembly Metrics Bioinformatics DOI: 10.1093/bioinformatics/btr451

Loman Labs

Assessing Ion Torrent assembly quality with Mauve Assembly Metrics

Related Posts

Balti and Bioinformatics: 14th November 2019 14 Oct 2019

Food in Birmingham 11 Aug 2019

How to generate consensus sequences using nanopolish 21 Dec 2018