Methods for taxonomic assignment of shotgun whole-genome metagenomics reads
05 Aug 2013Part of an introductory course on whole-genome shotgun metagenomic sequence analysis, I thought it might be useful to reproduce here.
I am working on building a regularly updated PAUDA and LAST index of microbial-RefSeq protein sequences that might be useful for those who would like to save some time, or don't have enough memory to build these indices.
Taxonomic assignments
Reference databases used for taxonomy
Source |
Database |
Notes |
Link |
NCBI |
Non-redundant proteins (nr) |
Pre-built BLAST database, use fastacmd to extract sequences |
ftp://ftp.ncbi.nlm.nih.gov/blast/db/ |
NCBI |
Non-redundant nucleotides (nt) |
Pre-built BLAST database |
ftp://ftp.ncbi.nlm.nih.gov/blast/db/ |
NCBI |
Microbial RefSeq genomes |
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/ |
|
NCBI |
Microbial RefSeq proteins |
*.faa files |
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/ |
NCBI |
RefSeq genomes |
Huge |
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/ |
NCBI |
RefSeq proteins |
Huge |
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/ |
HMPDACC |
HMPREFG |
Human body-site specific |
|
Metaphlan |
Metaphlan lineage-specific |
Tiny, only considers lineage-specific sequences, pre-built for Bowtie2 or BLAST | http://huttenhower.sph.harvard.edu/metaphlan |
For further databases and more information, see the useful page from the CAMERA project: http://camera.calit2.net/camdata.shtm
Comparison of software for reference-based taxonomic classification
Programme |
Input |
Space |
Sensitivity |
Speed |
BLASTN |
Contigs or reads |
nuc |
- |
Slow |
BLASTX |
Contigs or reads |
prot |
++ |
Slow |
BLAT |
Contigs or reads |
nuc or prot |
-/+ |
Medium |
PAUDA |
Reads |
ambiguous prot |
+ |
Fast |
LAST |
Contigs or reads |
nuc/prot |
-/+ |
Fast |
Metaphlan |
Reads |
nuc |
? |
V.fast |
BWA |
Reads |
nuc |
- |
V.fast |
Bowtie2 |
Reads |
nuc |
- |
V.fast |
Rapsearch2 |
Reads |
nuc/prot |
+ |
Medium |
Others: BLASTZ, mpiBLAST (@yokofakun), USEARCH (@guyleonard)
Rapsearch2
Download NR <ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz>.
Rapsearch2 takes FASTA as input, so convert using:
seqtk seq -A <MYREADS>.fastq > <MYREADS>.fasta gunzip nr.gz prerapsearch -d nr.rap -n nr rapsearch -q <MYREADS>.fasta -d nr.rap -z 16 -o <MYREADS>.rap -b 50 -v 50 -a t
Import to MEGAN via the command-line
echo "load taxGIFile='/mnt/phatso/nick/rapsearchdb/gi_taxid_prot.bin';" > "$fn".cmd echo "load keggGIFile='/mnt/phatso/nick/rapsearchdb/gi2kegg.map';" >> "$fn".cmd echo "import blastFile=$fn.rap.aln meganFile=$fn.rap.rma maxMatches=100 minScore=50.0 maxExpected=1.0 topPercent=10 minSupport=5 minComplexity=0.44 useMinimalCoverageHeuristic=false useSeed=false useCOG=false useKegg=true paired=false useIdentityFilter=false textStoragePolicy=0 blastFormat=RapSearch mapping='Taxonomy:GI_MAP=true,KEGG:GI_MAP=true';" >> "$fn".cmd echo "quit;" >> "$fn".cmd xvfb-run -a MEGAN5/MEGAN -g -c "$fn".cmd
How to fetch Microbial Protein Sequences
The easiest and fastest way is via the command-line, using the rsync protocol which is supported by NCBI. The advantage of rsync here is that it will only sync updates, so it is tolerant of breaks in connection, and you can run it regularly and it will only fetch files that have changed.
rsync -avz rsync://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/ \
--include "*/" --include "*.faa.gz" --exclude "*" .
This command will download all files matching the pattern '*.faa.gz', i.e. microbial protein sequences.
rsync -avz rsync://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/ \ --include "*/" --include "*.genomic.gz" --exclude "*" .
Now we need to concatenate them all together:
zcat *.gz > microbial_refseq.faa
Taxonomic assignments with LAST
Creating the LAST database
Before we can search our reads, we need to build the database.
$ bin/last-291/src/lastdb microbial.lastdb <(zcat refs/microbial*)
This is an example of redirection: we are redirecting the output of the command 'zcat refs/microbial*' as the FASTA input. This saves us having to decompress and concatenate the file.
lastdb: that's some funny-lookin DNA
Woops! We need to provide -p option to tell LAST we are indexing a protein database.
$ bin/last-291/src/lastdb -p microbial.lastdb <(zcat refs/microbial*)
This process takes an hour or two on a fast server, and consumes quite a lot of memory.
Assigning reads with LAST
LAST will take FASTQ files, but only for nucleotide databases. We are going to use the six-frame translated mode, as indicated by the -F15 option. In this mode, LAST requires FASTA output!
Here's the l33t way to do it:
time lastal -F15 microbial.last \
<(seqtk seq -A datasets/ecoli/subsample/1122-H_1.fastq) \
>1122_H_1.maf
real 3m25.313s
user 3m15.316s
sys 0m10.028s
This is the same as:
seqtk seq -A datasets/ecoli/subsample/1122-H_1.fastq > 1122_H_1.fasta
lastal -F15 microbial.last 1122_H_1.fasta > 1122_H_1.maf
But you save having to create an intermediate FASTA file.
Questions: How long did it take to assign 100,000 reads with LAST?
Convert LAST results to BLAST format
LAST outputs MAF format which cannot be read by MEGAN without conversion, which we will use later. Additionally, the MAF to BLAST converter supplied with LAST does not produce exactly the right format, please use my modified version instead. The MAF files first need to be sorted by the read identifier (-n2), and then converted, e.g.
maf-sort.sh -n2 1122-H.maf | maf-convert.py blast 1122-H.maf > 1122-H.blast.txt
Running Metaphlan
Metaphlan can use either BLAST or Bowtie2 for assignments. Bowtie2 is significantly faster, so we will use that. It should be already on your path.
metaphlan.py <fastq_file> \
--bt2_ps sensitive-local \
--input_type multifastq \
--bowtie2db software/metaphlan/bowtie2db/mpa \
-t rel_ab \
-o <relative_abundances_file>
metaphlan.py 2535b-H-STEC_1.fastq \
--bt2_ps sensitive-local --input_type multifastq \
--bowtie2db software/metaphlan/bowtie2db/mpa \
-t rel_ab \
-o 2535b.relativeabundances.txt