Assembling Illumina and 454 data

This is a question that keeps cropping up on Seqanswers and Biostar.

Amazingly there is still no 100% satisfactory pipeline for assembling combined Illumina and 454 data de novo.

Here are the ways I know about:

1) Assemble 454 data on its own and correct with Illumina data

For example, Newbler for the 454 data. Then correct the resulting file with a mapping pipeline like Nesoni.

Advantages:

Disadvantages:

2) Perform a hybrid assembly with MIRA

Advantages:

Disadvantages:

3) Perform a hybrid assembly with CLC Genomics Workbench

Advantages:

Disadvantages:

4) Perform a hybrid assembly with ... Seqman Ngen/RAY/Celera Assembler/other

Included for completeness, I have not spent much time with these packages.

5) Assemble Illumina data and 454 data separately and combine with MINIMUS

Advantages

Disadvantages

6) Fake Sanger reads from 454 or Illumina assembly and feed to the other assembler

I really don't like this approach as so much useful information is lost in the resulting assembly, so I haven't tried it.

7) Local assembly of abundant paired-end data to fill 454 scaffolds

This is a useful complementary approach to the ones above - can use BGI's GapCloser or IMAGE to try and fill gaps in scaffolds by using Illumina abundant paired-end data in conjunction with local assembly.

Update: 8) Newbler 2.6, incorporating FASTQ files

I can't believe I forgot this, thanks to Anthony Underwood for reminding me.

Newbler 2.6 will now accept FASTQ files and so this may be a good option. I am going to have a play around with it and will post back my findings.

Conclusion

I still think it's surprising there is no definitive assembly solution that can use 454 and Illumina data of all flavours and produce reliable, error-corrected scaffolds. Please correct me if I'm wrong! Similar issues may apply to combining Illumina or SOLiD data with Ion Torrent, PacBio, etc.

 

Comments, corrections, feedback as always appreciated.

Postscript:

One issue here is that historically you use a fundamentally different approach for 454 data and Illumina data - the former uses overlap-layout-consensus and Illumina uses de Bruijn graphs. However it may be with the advent of longer Illumina reads 100-150bp and greater accuracy (particularly if you use a k-mer error correction approach) overlap-layout-consensus becomes an option with Illumina. Jared Simpson is experimenting with string graphs as an alternative to de Bruijn.