Assembling Illumina and 454 data

16 Aug 2011

This is a question that keeps cropping up on Seqanswers and Biostar.

Amazingly there is still no 100% satisfactory pipeline for assembling combined Illumina and 454 data de novo.

Here are the ways I know about:

1) Assemble 454 data on its own and correct with Illumina data

For example, Newbler for the 454 data. Then correct the resulting file with a mapping pipeline like Nesoni.

Advantages:

Newbler still works best on 454 data
Newbler scaffolder works pretty well with 454 PE data
Corrects homopolymers/indel errors well
Quite quick
Newbler 2.6 has a handy gap filling mode (-scaffold on command line)

Disadvantages:

Extra Illumina coverage won't aid assembly contiguity (important if low-coverage 454 data)
Won't correct structural misassemblies in 454 assembly (although it may detect them)

2) Perform a hybrid assembly with MIRA

Advantages:

Gives very reliable output
Natively supports 454 and Illumina data at overlap stage
Can view assembly in GAP4 and see 454 and Illumina reads, and quickly find problems

Disadvantages:

Quite slow
Memory hungry with lots of Illumina reads
Will not scaffold using paired-end 454 data or mate-pair Illumina data, need to do this with BAMBUS, SSPACE or other

3) Perform a hybrid assembly with CLC Genomics Workbench

Advantages:

Very quick
Native support of SFF and FASTQ formats

Disadvantages:

Closed source, closed methods - hard to know what it is doing
Not many user-configurable parameters
Does not support paired-end 454 data or mate-pair Illumina data to produce scaffolds

4) Perform a hybrid assembly with ... Seqman Ngen/RAY/Celera Assembler/other

Included for completeness, I have not spent much time with these packages.

5) Assemble Illumina data and 454 data separately and combine with MINIMUS

Advantages

Reasonably quick
Can use "best" assembler for each flavour of data
Theoretically provides independent confirmation of each assembly

Disadvantages

When there are disagreements, which assembly is correct?
Coverage not additive so unlikely to result in improved contiguity
Can propagate misassemblies in either assembly
Difficult to use with gapped scaffolds

6) Fake Sanger reads from 454 or Illumina assembly and feed to the other assembler

I really don't like this approach as so much useful information is lost in the resulting assembly, so I haven't tried it.

7) Local assembly of abundant paired-end data to fill 454 scaffolds

This is a useful complementary approach to the ones above - can use BGI's GapCloser or IMAGE to try and fill gaps in scaffolds by using Illumina abundant paired-end data in conjunction with local assembly.

Update: 8) Newbler 2.6, incorporating FASTQ files

I can't believe I forgot this, thanks to Anthony Underwood for reminding me.

Newbler 2.6 will now accept FASTQ files and so this may be a good option. I am going to have a play around with it and will post back my findings.

Conclusion

I still think it's surprising there is no definitive assembly solution that can use 454 and Illumina data of all flavours and produce reliable, error-corrected scaffolds. Please correct me if I'm wrong! Similar issues may apply to combining Illumina or SOLiD data with Ion Torrent, PacBio, etc.

Comments, corrections, feedback as always appreciated.

Postscript:

One issue here is that historically you use a fundamentally different approach for 454 data and Illumina data - the former uses overlap-layout-consensus and Illumina uses de Bruijn graphs. However it may be with the advent of longer Illumina reads 100-150bp and greater accuracy (particularly if you use a k-mer error correction approach) overlap-layout-consensus becomes an option with Illumina. Jared Simpson is experimenting with string graphs as an alternative to de Bruijn.

Loman Labs

Assembling Illumina and 454 data

Related Posts

Balti and Bioinformatics: 14th November 2019 14 Oct 2019

Food in Birmingham 11 Aug 2019

How to generate consensus sequences using nanopolish 21 Dec 2018