Methods for taxonomic assignment of shotgun whole-genome metagenomics reads

Part of an introductory course on whole-genome shotgun metagenomic sequence analysis, I thought it might be useful to reproduce here.

I am working on building a regularly updated PAUDA and LAST index of microbial-RefSeq protein sequences that might be useful for those who would like to save some time, or don't have enough memory to build these indices.

Taxonomic assignments

Reference databases used for taxonomy

Source

Database

Notes

Link

NCBI

Non-redundant proteins (nr)

Pre-built BLAST database, use fastacmd to extract sequences

ftp://ftp.ncbi.nlm.nih.gov/blast/db/

NCBI

Non-redundant nucleotides (nt)

Pre-built BLAST database

ftp://ftp.ncbi.nlm.nih.gov/blast/db/

NCBI

Microbial RefSeq genomes

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/

NCBI

Microbial RefSeq proteins

*.faa files

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/

NCBI

RefSeq genomes

Huge

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/

NCBI

RefSeq proteins

Huge

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/

HMPDACC

HMPREFG

Human body-site specific

http://www.hmpdacc.org/HMREFG/

Metaphlan

Metaphlan lineage-specific

Tiny, only considers lineage-specific sequences, pre-built for Bowtie2 or BLAST  http://huttenhower.sph.harvard.edu/metaphlan

For further databases and more information, see the useful page from the CAMERA project: http://camera.calit2.net/camdata.shtm

Comparison of software for reference-based taxonomic classification

Programme

Input

Space

Sensitivity

Speed

BLASTN

Contigs or reads

nuc

-

Slow

BLASTX

Contigs or reads

prot

++

Slow

BLAT

Contigs or reads

nuc or prot

-/+

Medium

PAUDA

Reads

ambiguous prot

+

Fast

LAST

Contigs or reads

nuc/prot

-/+

Fast

Metaphlan

Reads

nuc

?

V.fast

BWA

Reads

nuc

-

V.fast

Bowtie2

Reads

nuc

-

V.fast

Rapsearch2

Reads

nuc/prot

+

Medium

Others: BLASTZ, mpiBLAST (@yokofakun), USEARCH (@guyleonard)

Rapsearch2

Download NR <ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz>.

Rapsearch2 takes FASTA as input, so convert using:

seqtk seq -A <MYREADS>.fastq > <MYREADS>.fasta
gunzip nr.gz
prerapsearch -d nr.rap -n nr
rapsearch -q <MYREADS>.fasta -d nr.rap -z 16 -o <MYREADS>.rap -b 50 -v 50 -a t

Import to MEGAN via the command-line

echo "load taxGIFile='/mnt/phatso/nick/rapsearchdb/gi_taxid_prot.bin';" > "$fn".cmd
echo "load keggGIFile='/mnt/phatso/nick/rapsearchdb/gi2kegg.map';" >> "$fn".cmd
echo "import blastFile=$fn.rap.aln meganFile=$fn.rap.rma maxMatches=100 minScore=50.0 maxExpected=1.0 topPercent=10 minSupport=5 minComplexity=0.44 useMinimalCoverageHeuristic=false useSeed=false useCOG=false useKegg=true paired=false useIdentityFilter=false textStoragePolicy=0 blastFormat=RapSearch mapping='Taxonomy:GI_MAP=true,KEGG:GI_MAP=true';" >> "$fn".cmd
echo "quit;" >> "$fn".cmd
xvfb-run -a MEGAN5/MEGAN -g -c "$fn".cmd

How to fetch Microbial Protein Sequences

The easiest and fastest way is via the command-line, using the rsync protocol which is supported by NCBI. The advantage of rsync here is that it will only sync updates, so it is tolerant of breaks in connection, and you can run it regularly and it will only fetch files that have changed.

rsync -avz rsync://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/ \
--include "*/" --include "*.faa.gz" --exclude "*" .

This command will download all files matching the pattern '*.faa.gz', i.e. microbial protein sequences.

rsync -avz rsync://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/ \
 --include "*/" --include "*.genomic.gz" --exclude "*" .

Now we need to concatenate them all together:

zcat *.gz > microbial_refseq.faa

Taxonomic assignments with LAST

Creating the LAST database

Before we can search our reads, we need to build the database.

$ bin/last-291/src/lastdb microbial.lastdb <(zcat refs/microbial*)

This is an example of redirection: we are redirecting the output of the command 'zcat refs/microbial*' as the FASTA input. This saves us having to decompress and concatenate the file.

lastdb: that's some funny-lookin DNA

Woops! We need to provide -p option to tell LAST we are indexing a protein database.

$ bin/last-291/src/lastdb -p microbial.lastdb <(zcat refs/microbial*)

This process takes an hour or two on a fast server, and consumes quite a lot of memory.

Assigning reads with LAST

LAST will take FASTQ files, but only for nucleotide databases. We are going to use the six-frame translated mode, as indicated by the -F15 option. In this mode, LAST requires FASTA output!

Here's the l33t way to do it:

time lastal -F15 microbial.last \
<(seqtk seq -A datasets/ecoli/subsample/1122-H_1.fastq) \
>1122_H_1.maf

real    3m25.313s
user    3m15.316s
sys     0m10.028s

This is the same as:

seqtk seq -A datasets/ecoli/subsample/1122-H_1.fastq > 1122_H_1.fasta
lastal -F15 microbial.last 1122_H_1.fasta > 1122_H_1.maf

But you save having to create an intermediate FASTA file.

Questions: How long did it take to assign 100,000 reads with LAST?

Convert LAST results to BLAST format

LAST outputs MAF format which cannot be read by MEGAN without conversion, which we will use later. Additionally, the MAF to BLAST converter supplied with LAST does not produce exactly the right format, please use my modified version instead.  The MAF files first need to be sorted by the read identifier (-n2), and then converted, e.g.

maf-sort.sh -n2 1122-H.maf | maf-convert.py blast 1122-H.maf > 1122-H.blast.txt

Running Metaphlan

Metaphlan can use either BLAST or Bowtie2 for assignments. Bowtie2 is significantly faster, so we will use that. It should be already on your path.

metaphlan.py <fastq_file> \
--bt2_ps sensitive-local \
--input_type multifastq \
--bowtie2db software/metaphlan/bowtie2db/mpa \
-t rel_ab \
-o <relative_abundances_file>

metaphlan.py 2535b-H-STEC_1.fastq \
--bt2_ps sensitive-local --input_type multifastq \
--bowtie2db software/metaphlan/bowtie2db/mpa \
-t rel_ab \
-o 2535b.relativeabundances.txt