Workplan for STEC paper results generation
02 Nov 2012Have 46 samples from two HiSeq 2500 flowcells to analyse, with the aim of producing a bunch of publication-ready figures and tables over the next few days.
Try to make each figure / table “publication-ready” as we go along. Prioritise the easy stuff first, in the order of processing.
Incorporate runs.txt, samples.txt and results into a SQLite database and access that via R for speed.
Figure: Run stats by flowcell
- Show number of reads / throughput per sample by flowcell.
- Split MiSeq from HiSeq using facets.
- Label points by sample ID if possible
Figure: Percentage human reads by stool consistency
- Initially a scatterplot
- Consider a box-and-whisker plot if looks good
These both depend on filtering against hg19 step.
Table: Run stats
Table: Alignment stats against 280
Figure: Plot stx2 ratio against other data
Selection of informative coverage plots (different phage copy number, other pathogens)
Figure: Taxonomic assignment by phylum
Depends on Metaphlan.
Figure: Presence of top 10 most abundant genera/species by sample
Figure: Virulence genes grid
Depends on virulence genes assignment.
Figure: E. coli pangenome analysis
Figure: Coverage plots for non-STEC genomes
Figure: wrbA vs Shiga toxin ratio
WrbA:
>lcl||EC55989_1114|wrbA|95288236 TrpR binding protein WrbA
ATGGCTAAAGTTCTGGTGCTTTATTATTCCATGTACGGACATATTGAAACGATGGCACGC
GCAGTCGCTGAGGGTGCAAGCAAAGTCGATGGCGCAGAAGTTGTCGTTAAGCGTGTACCG
GAAACCATGCCGCCGCAATTATTTGAAAAAGCAGGCGGTAAAACGCAAACTGCACCGGTT
GCAACCCCGCAAGAACTGGCCGATTACGACGCCATTATTTTTGGTACACCTACCCGCTTT
GGCAACATGTCCGGTCAAATGCGTACCTTCCTCGACCAGACGGGCGGCCTGTGGGCTTCC
GGCGCACTATACGGAAAACTGGCGAGCGTCTTTAGTTCCACCGGTACTGGCGGCGGTCAG
GAACAAACTATTACTTCAACCTGGACGACCCTTGCGCATCACGGCATGGTAATTGTCCCC
ATTGGCTACGCAGCGCAGGAATTATTTGACGTTTCACAGGTTCGCGGCGGTACGCCGTAC
GGCGCAACCACCATCGCAGGCGGTGACGGCTCACGCCAGCCAAGCCAGGAAGAACTGTCT
ATTGCTCGTTATCAAGGGGAATATGTCGCAGGTCTGGCAGTTAAACTTAACGGCTAA
WrbA breakpoint 1:
Score = 99.6 bits (50), Expect = 3e-21
Identities = 50/50 (100%)
Strand = Plus / Plus
Query: 1 atggctaaagttctggtgctttattattccatgtacggacatattgaaac 50
||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 3256078 atggctaaagttctggtgctttattattccatgtacggacatattgaaac 3256127
WrbA breakpoint 2:
Query: 518 agccaagccaggaagaactgtctattgctcgttatcaaggggaatatgtcgcaggtctgg 577
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 3317479 agccaagccaggaagaactgtctattgctcgttatcaaggggaatatgtcgcaggtctgg 3317538
Query: 578 cagttaaacttaacggctaa 597
||||||||||||||||||||
Sbjct: 3317539 cagttaaacttaacggctaa 3317558
SQLite for metdata
To facilitate these analyses I am going to store everything in a SQLite database instead of TSV files.
Permits neat partitioning of tasks across blades using Ruffus:
python pipeline.py -s metagenomics.sqlite3 -v 5 \
-c "SELECT * FROM runs WHERE Description = 'HiSeq 2500 Run' order by SampleName LIMIT 30,15;"
Also makes it easier to store results, should have done this earlier!