Tips for de novo bacterial genome assembly

I have found Velvet to be an excellent option for short-read de novo assembly of bacterial genomes. We are routinely seeing Velvet assemblies of Illumina data producing similar statistics to Newbler assemblis of 454 data. It seems that Illumina run with 75-base pair reads and paired-end libraries can be competitive with Newbler assemblies from 454 fragment libraries. Amazing that just a year or two the dogma was that short-read technologies weren''t useful for de novo assemby.

However, getting the best out of Velvet is a bit trickier than using Newbler. Roche's informatics is really slick and running a Newbler assembly is pretty much "fire and forget". However, the cost-effectiveness of Illumina means that this is turning into a popular choice for bacterial genomics. I am frequently asked at presentations for details on our pipeline for handling Illumina data, so I thought I could point people at this blog post in future.

Firstly, de novo assembly with short-reads gives much more effective results if you aggressively pre-filter the reads passed to the assembler (whether it is Velvet or something else). The difference in assembly statistics between passing the "raw" base-call output from Illumina versus passing a stringently filtered set is night-and-day. For an M. tuberculosis genome, the assembly N50 jumped from 1kb to 30kb after filtering the reads. It is also not uncommon to find Illumina datasets with 100-1000x fold-coverage for the genome due to the massively parallel nature of this technology. In this case more is not necessarily a good thing, submitting too many reads will dramatically increase memory requirements and may also result in a worse quality assembly due to the excess of error reads which complicate the de Bruijn graph. Velvet does not take into account quality scores which perhaps explains the utility of such filtering steps.

Filtering reads is a question of trial and error but I have found the following techniques are all useful. You can use them all, or mix and match for your application. The scripts I reference are all Python scripts which require the latest version (>=1.51b) of the invaluable BioPython package.

All the scripts assume you have a single FASTQ file. If there is paired-end information then the forward read should be followed directly by the reverse read in the file. If your pipeline produces a forward and a reverse read file then use Velvet's bundled shuffleSequences.pl script to produce a single FASTQ file before proceeding.

1) Count per-base quality scores and trim the reads

Run quality_breakdown.py on your FASTQ file. The stdout will produce a tab-separated report showing the mean quality score at each base position. You can plot this in Excel and it usually shows a linearly degrading quality across the read. You have to use your disgression where to trim each read, but I tend to cut it about 20 bases in from the end or where the average quality is 20, whichever gives the longer read. Run trim_reads.py over your file specifying where in the read to trim. I like to have all reads the same length at the end.

2) Eliminate singletons

If you have excess depth of coverage, and particularly if you have at least x-fold coverage where x is the read length, then eliminating singletons is a nice way of dramatically reducing the number of error-prone reads you pass to Velvet. If you don't have great coverage this step might not be possible. Simply run eliminate_singletons.py passing the FASTQ file as the first argument. It will ensure that the output still correctly pairs the reads.

3) Chuck away reads with Ns in

If you can afford the loss of coverage, you might throw away all reads containing Ns. The Velvet internals use a 2-bit encoding system for nucleotides and so does not make provision for Ns. I believe it converts Ns to a base chosen at random (it used to use Ns). eliminate_n_paired.py will discard any pairs which contain Ns. Be cautious, I once had a Solexa read file in which every read at positions 31, 32 and 33 were N so I had to trim the reads at position 30 before running this script.

4) Throw out reads with low average quality

This step is probably optional. You can discard all reads with an average score below a certain threshold using filter_reads.py, I might use 20 for the sake of argument.

Once you have a filtered read-file you may want to use Simon Gladman's excellent VelvetOptimiser script (also bundled with Velvet) to perform a parameter scan looking for the assembly with the best 'contiguity', i.e. greatest N50. Whilst it is great to have long contigs - this makes life easier when doing downstream analysis - it is worth bearing in mind that a greater N50 is associated with a higher chance of misassemblies. If you are doing true de novo assembly you will have no reference to check the results against. It is therefore imperative you do some sanity checking on your assembly, particularly if you are using the output for variation detection between closely related strains (as opposed to, say, getting a quick and dirty proteome prediction).

I would be keen to hear what other people to do to help validate their assemblies, but one tip I can impart is to use a short-read aligner (I like Bowtie, but BWA, MAQ or NovoAlign would work as well) to map those reads back to the draft assembly produced by Velvet. By searching for high-frequency variants, i.e. SNPs and indels, either using VarScan or the built-in MAQ SNP pipeline you would expect to see few or hopefully no homozygous variant calls.

If you do see variant calls, check the read-depth of that contig. If it is 2 s.d. above the mean coverage for the genomes you are likely looking at a repetitive consensus contig, i.e. a contig produced by collapsing two genomic regions into a single contig. SNP calls in these contigs are unreliable by their nature. If you are seeing lots of variant calls you may be suffering from misassemblies. Consider changing the Velvet settings to make it more sensitive. SNPs occurring near the contig boundaries are likely to be due to spurious alignments and probably can be ignored. SNP calls occurring in all reads in the middle of long contigs should raise alarm bells.

When trying this approach with some paired-end data I came across a curious phenomenemon. I was getting about 60 SNP variant calls when mapping my reads back to my assembled genome. I used the excellent Hawkeye, part of the AMOS package to try and see what was happening in this region. I wanted to see if there were obvious signs of misassemblies, i.e. low-quality alignments in that region. What I found was unexpected. The SNP call was different from Velvet's consensus, but the pile-up alignment at that base agreed with the SNP call. It looked like the consensus was being called incorrectly. Confused by the result, the ever-helpful Daniel Zerbino nailed the problem immediately - the consensus was being called from the original contig sequence where there was a mixture of alleles at the locus, before the paired-end resolution code was kicking in. Velvet correctly untangles the repeat using the paired-end information, but did not recall the consensus.

A way of fixing this problem easily is to use the AMOS recallConsensus software. This goes base-by-base through the alignment and calls the likeliest base at each position. It's pretty slow, but it works OK. To produce AMOS files you need to turn the read tracking mode of Velvet on, and then convert the AFG file to a bank file. After you have run recallConsensus you can convert the bank back to a FASTA file and use that for your further analysis.

I'd be interested to hear from readers any of their experiences with using Velvet for de novo bacterial genome sequencing.