Assembling Illumina and 454 data
16 Aug 2011This is a question that keeps cropping up on Seqanswers and Biostar.
Amazingly there is still no 100% satisfactory pipeline for assembling combined Illumina and 454 data de novo.
Here are the ways I know about:
1) Assemble 454 data on its own and correct with Illumina data
For example, Newbler for the 454 data. Then correct the resulting file with a mapping pipeline like Nesoni.
Advantages:
- Newbler still works best on 454 data
- Newbler scaffolder works pretty well with 454 PE data
- Corrects homopolymers/indel errors well
- Quite quick
- Newbler 2.6 has a handy gap filling mode (-scaffold on command line)
Disadvantages:
- Extra Illumina coverage won't aid assembly contiguity (important if low-coverage 454 data)
- Won't correct structural misassemblies in 454 assembly (although it may detect them)
2) Perform a hybrid assembly with MIRA
Advantages:
- Gives very reliable output
- Natively supports 454 and Illumina data at overlap stage
- Can view assembly in GAP4 and see 454 and Illumina reads, and quickly find problems
Disadvantages:
- Quite slow
- Memory hungry with lots of Illumina reads
- Will not scaffold using paired-end 454 data or mate-pair Illumina data, need to do this with BAMBUS, SSPACE or other
3) Perform a hybrid assembly with CLC Genomics Workbench
Advantages:
- Very quick
- Native support of SFF and FASTQ formats
Disadvantages:
- Closed source, closed methods - hard to know what it is doing
- Not many user-configurable parameters
- Does not support paired-end 454 data or mate-pair Illumina data to produce scaffolds
4) Perform a hybrid assembly with ... Seqman Ngen/RAY/Celera Assembler/other
Included for completeness, I have not spent much time with these packages.
5) Assemble Illumina data and 454 data separately and combine with MINIMUS
Advantages
- Reasonably quick
- Can use "best" assembler for each flavour of data
- Theoretically provides independent confirmation of each assembly
Disadvantages
- When there are disagreements, which assembly is correct?
- Coverage not additive so unlikely to result in improved contiguity
- Can propagate misassemblies in either assembly
- Difficult to use with gapped scaffolds
6) Fake Sanger reads from 454 or Illumina assembly and feed to the other assembler
I really don't like this approach as so much useful information is lost in the resulting assembly, so I haven't tried it.
7) Local assembly of abundant paired-end data to fill 454 scaffolds
This is a useful complementary approach to the ones above - can use BGI's GapCloser or IMAGE to try and fill gaps in scaffolds by using Illumina abundant paired-end data in conjunction with local assembly.
Update: 8) Newbler 2.6, incorporating FASTQ files
I can't believe I forgot this, thanks to Anthony Underwood for reminding me.
Newbler 2.6 will now accept FASTQ files and so this may be a good option. I am going to have a play around with it and will post back my findings.
Conclusion
I still think it's surprising there is no definitive assembly solution that can use 454 and Illumina data of all flavours and produce reliable, error-corrected scaffolds. Please correct me if I'm wrong! Similar issues may apply to combining Illumina or SOLiD data with Ion Torrent, PacBio, etc.
Comments, corrections, feedback as always appreciated.
Postscript:
One issue here is that historically you use a fundamentally different approach for 454 data and Illumina data - the former uses overlap-layout-consensus and Illumina uses de Bruijn graphs. However it may be with the advent of longer Illumina reads 100-150bp and greater accuracy (particularly if you use a k-mer error correction approach) overlap-layout-consensus becomes an option with Illumina. Jared Simpson is experimenting with string graphs as an alternative to de Bruijn.