Methods for taxonomic assignment of shotgun whole-genome metagenomics reads

05 Aug 2013

Part of an introductory course on whole-genome shotgun metagenomic sequence analysis, I thought it might be useful to reproduce here.

I am working on building a regularly updated PAUDA and LAST index of microbial-RefSeq protein sequences that might be useful for those who would like to save some time, or don't have enough memory to build these indices.

Taxonomic assignments

Reference databases used for taxonomy

Source	Database	Notes	Link
NCBI	Non-redundant proteins (nr)	Pre-built BLAST database, use fastacmd to extract sequences	ftp://ftp.ncbi.nlm.nih.gov/blast/db/
NCBI	Non-redundant nucleotides (nt)	Pre-built BLAST database	ftp://ftp.ncbi.nlm.nih.gov/blast/db/
NCBI	Microbial RefSeq genomes		ftp://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/
NCBI	Microbial RefSeq proteins	*.faa files	ftp://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/
NCBI	RefSeq genomes	Huge	ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/
NCBI	RefSeq proteins	Huge	ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/
HMPDACC	HMPREFG	Human body-site specific	http://www.hmpdacc.org/HMREFG/
Metaphlan	Metaphlan lineage-specific	Tiny, only considers lineage-specific sequences, pre-built for Bowtie2 or BLAST	http://huttenhower.sph.harvard.edu/metaphlan

For further databases and more information, see the useful page from the CAMERA project: http://camera.calit2.net/camdata.shtm

Comparison of software for reference-based taxonomic classification

Programme	Input	Space	Sensitivity	Speed
BLASTN	Contigs or reads	nuc	-	Slow
BLASTX	Contigs or reads	prot	++	Slow
BLAT	Contigs or reads	nuc or prot	-/+	Medium
PAUDA	Reads	ambiguous prot	+	Fast
LAST	Contigs or reads	nuc/prot	-/+	Fast
Metaphlan	Reads	nuc	?	V.fast
BWA	Reads	nuc	-	V.fast
Bowtie2	Reads	nuc	-	V.fast
Rapsearch2	Reads	nuc/prot	+	Medium

Others: BLASTZ, mpiBLAST (@yokofakun), USEARCH (@guyleonard)

Rapsearch2

Download NR <ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz>.

Rapsearch2 takes FASTA as input, so convert using:

seqtk seq -A <MYREADS>.fastq > <MYREADS>.fasta
gunzip nr.gz
prerapsearch -d nr.rap -n nr
rapsearch -q <MYREADS>.fasta -d nr.rap -z 16 -o <MYREADS>.rap -b 50 -v 50 -a t

Import to MEGAN via the command-line

echo "load taxGIFile='/mnt/phatso/nick/rapsearchdb/gi_taxid_prot.bin';" > "$fn".cmd
echo "load keggGIFile='/mnt/phatso/nick/rapsearchdb/gi2kegg.map';" >> "$fn".cmd
echo "import blastFile=$fn.rap.aln meganFile=$fn.rap.rma maxMatches=100 minScore=50.0 maxExpected=1.0 topPercent=10 minSupport=5 minComplexity=0.44 useMinimalCoverageHeuristic=false useSeed=false useCOG=false useKegg=true paired=false useIdentityFilter=false textStoragePolicy=0 blastFormat=RapSearch mapping='Taxonomy:GI_MAP=true,KEGG:GI_MAP=true';" >> "$fn".cmd
echo "quit;" >> "$fn".cmd
xvfb-run -a MEGAN5/MEGAN -g -c "$fn".cmd

How to fetch Microbial Protein Sequences

The easiest and fastest way is via the command-line, using the rsync protocol which is supported by NCBI. The advantage of rsync here is that it will only sync updates, so it is tolerant of breaks in connection, and you can run it regularly and it will only fetch files that have changed.

rsync -avz rsync://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/ \
--include "*/" --include "*.faa.gz" --exclude "*" .

This command will download all files matching the pattern '*.faa.gz', i.e. microbial protein sequences.

rsync -avz rsync://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/ \
 --include "*/" --include "*.genomic.gz" --exclude "*" .

Now we need to concatenate them all together:

zcat *.gz > microbial_refseq.faa

Taxonomic assignments with LAST

Creating the LAST database

Before we can search our reads, we need to build the database.

$ bin/last-291/src/lastdb microbial.lastdb <(zcat refs/microbial*)

This is an example of redirection: we are redirecting the output of the command 'zcat refs/microbial*' as the FASTA input. This saves us having to decompress and concatenate the file.

lastdb: that's some funny-lookin DNA

Woops! We need to provide -p option to tell LAST we are indexing a protein database.

$ bin/last-291/src/lastdb -p microbial.lastdb <(zcat refs/microbial*)

This process takes an hour or two on a fast server, and consumes quite a lot of memory.

Assigning reads with LAST

LAST will take FASTQ files, but only for nucleotide databases. We are going to use the six-frame translated mode, as indicated by the -F15 option. In this mode, LAST requires FASTA output!

Here's the l33t way to do it:

time lastal -F15 microbial.last \
<(seqtk seq -A datasets/ecoli/subsample/1122-H_1.fastq) \
>1122_H_1.maf

real    3m25.313s
user    3m15.316s
sys     0m10.028s

This is the same as:

seqtk seq -A datasets/ecoli/subsample/1122-H_1.fastq > 1122_H_1.fasta
lastal -F15 microbial.last 1122_H_1.fasta > 1122_H_1.maf

But you save having to create an intermediate FASTA file.

Questions: How long did it take to assign 100,000 reads with LAST?

Convert LAST results to BLAST format

LAST outputs MAF format which cannot be read by MEGAN without conversion, which we will use later. Additionally, the MAF to BLAST converter supplied with LAST does not produce exactly the right format, please use my modified version instead. The MAF files first need to be sorted by the read identifier (-n2), and then converted, e.g.

maf-sort.sh -n2 1122-H.maf | maf-convert.py blast 1122-H.maf > 1122-H.blast.txt

Running Metaphlan

Metaphlan can use either BLAST or Bowtie2 for assignments. Bowtie2 is significantly faster, so we will use that. It should be already on your path.

metaphlan.py <fastq_file> \
--bt2_ps sensitive-local \
--input_type multifastq \
--bowtie2db software/metaphlan/bowtie2db/mpa \
-t rel_ab \
-o <relative_abundances_file>

metaphlan.py 2535b-H-STEC_1.fastq \
--bt2_ps sensitive-local --input_type multifastq \
--bowtie2db software/metaphlan/bowtie2db/mpa \
-t rel_ab \
-o 2535b.relativeabundances.txt

Loman Labs