As part of our sequencing QC, we add 20-strain mock community to our run. I'm currently using it to optimize our DADA2 pipeline and am running into some inconsistencies when I run DADA2 in various "flavors".
It should be noted that we are still using a non-oriented library to sequence our paired-end reads.
I've run DADA2 in these flavors (attached below are the bash files used):
- Use cut-adapt to remove primers from R1 & R2, demultiplex, then run DADA2 as denoise-paired
- Use FLASh to join reads, use a custom python script to orient all joined reads in the same direction & trim off any primers, demultiplex, then run DADA2 as denoise-single
- Use a different custom python script to orient all reads in an un-joined fashion & trim off any primers, demultiplex, then run DADA2 as denoise_paired
After the feature table is generated, I filter it down to just the ATCC mock sample, and then BLAST the resulting ASVs against the full genomes of the 20-strains in order to identify each ASV as belonging to the original mock community.
In the table below, I've identified the number of reads that have a BLAST hit to each of the strains (true positives marked in green & any non-detected strains in red) as well as any false positives (marked in yellow).
Note that all BLAST hits were either 100% ID or had just a single base miss-match. In many cases, two ASVs would hit the same strain, in this case I've separated the read counts out (e.g. flavor 1 had two ASVs that hit A. baumannii with 31 & 30 counts).
Question 1: which method is better?
Which pipeline is the most "correct" way of doing things? I take it that it should be flavor 3; however it gives me the "worst" results.
Should I be doing anything differently?
Question 2: why multiple ASVs?
In almost all of the flavor 1 & 3 results, we see that multiple ASVs are identified for a given strain. Let's examine for example A. baumannii; the representative sequences found for each flavor are:
Flavor 1, ASV 1
TACAGAGGGTGCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTATTAAGTCGGATGTGAAATCCCCGAGCTTAACTTGGGAATTGCATTCGATACTGGTGAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACGCTGAGGTACGAAAGCATGGGGAGCAAACAGG
Flavor 1, ASV 2
CCTGTTTGCTCCCCATGCTTTCGTACCTCAGCGTCAGTATTAGGCCAGATGGCTGCCTTCGCCATCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCATCCTCTCCCATACTCTAGCTCACCAGTATCGAATGCAATTCCCAAGTTAAGCTCGGGGATTTCACATCCGACTTAATAAGCCGCCTACGCACGCTTTACGCCCAGTAAATCCGATTAACGCTCGCACCCTCTGTA
Flavor 3, ASV 1
TACAGAGGGTGCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTATTAAGTCGGATGTGAAATCCCCGAGCTTAACTTGGGAATTGCATTCGATACTGGTGAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACGCTGAGGTACGAAAGCATGGGGAGCAAACAGG
Flavor 3, ASV 2
CGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTATTAAGTCGGATGTGAAATCCCCGAGCTTAACTTGGGAATTGCATTCGATACTGGTGAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACGCTGAGGTACGAAAGCATGGGGAGC
In the case of flavor 1, the representative sequences are reverse complements of one another, this isn't surprising since we didn't do any read orientation. However, in flavor 3, ASV 2 is a perfect subset of ASV 1 (the section that has been bolded). What is causing this and how can it be alleviated? I'm guessing it has something to do with the quality scores for the reads that get flipped into the correct orientation and the subsequent error models that get built around it.
flavor1.sh.txt (2.9 KB)
flavor2.sh.txt (2.4 KB)
flavor3.sh.txt (2.6 KB)