SAM/BAM: It's time for a single standard for assembly output

Just what the title says really. A quick note after kicking around this issue with Peter Cock and the Tablet team.

We need a single standard for assembly output.

I don't really care what it is.

But it should be SAM/BAM.

Why? Easy. This is already the de facto standard for mapping alignments. It's actively developed. There are more viewers for SAM/BAM format than anything else - Tablet, IGV, Savant, Samtools (tview), Artemis/BAMview, even more here. It's crazy we need different viewers for different assemblers.

There is a powerful set of command-line tools for manipulating them, e.g. SAMtools, Picard, GATK and language bindings like Pysam.

Sorted BAM files are lean and can be randomly-accessed quickly.

It is a well-documented, open standard.

I'm not claiming its perfect, but its the best we have.

The not inconsiderable fly in the ointment: hardly any assemblers output SAM/BAM at the moment. Certainly none of the "popular" ones, e.g. Velvet (.frg), SOAPdenovo (no alignments), Abyss (no alignments), Newbler (ACE) for 454, MIRA (MAF and others) for hybrids. This is a shame.

So right now you probably need to convert. But I'd still say it's worth doing.

Peter Cock has a converter for MAF output for MIRA. It's possible - after a fashion - to convert ACE to SAM using glu-genetics but not pretty. Newbler gsMapper will dump SAM and we'll try and convince 454 to make the de novo assembler do the same thing.

I think you can get SAM out of AMOS bank format (not tried it).

Having all assemblies in SAM/BAM format would mean generic operations on assemblies, independent of the assembler used, would be possible:

Sure this is possible with FASTA files - that's what we do now, but the point is that each time you run an extra step on the assembly, the vital alignment and quality information can be kept, maintaining an audit trail for how an assembly was constructed. Regions of potential ambiguity (unreliable low coverage regions, high coverage regions indicating collapsed repeats, homopolymers in 454) etc are made explicit. This makes life much easier - when finishing - when calling SNPs, etc. etc.

How cool would it be to have assemblies incorporating multiple sequencer platform data, all coloured by read group in a viewer, with paired-end and mate-pair information correctly flagged? This would be assembly nirvana! (If you don't think that's uber-cool, probably this isn't the right blog for you).

And ideally this resulting SAM/BAM file is what would be sent to Genbank, rather than just FASTA+QUAL (and the raw reads to SRA if so inclined).

So - authors of assembly software - please make this happen! Assemblathon people, I'd like to campaign that you insist that competitors submit their assembly in SAM format, to help drive this change forward!

I know there may be objections because SAM/BAM is not entirely optimised for assemblies but please address this by extending the SAM specification!