Optimizing DADA2 pipeline with ATCC mock control

As part of our sequencing QC, we add 20-strain mock community to our run. I'm currently using it to optimize our DADA2 pipeline and am running into some inconsistencies when I run DADA2 in various "flavors".

It should be noted that we are still using a non-oriented library to sequence our paired-end reads.

I've run DADA2 in these flavors (attached below are the bash files used):

  1. Use cut-adapt to remove primers from R1 & R2, demultiplex, then run DADA2 as denoise-paired
  2. Use FLASh to join reads, use a custom python script to orient all joined reads in the same direction & trim off any primers, demultiplex, then run DADA2 as denoise-single
  3. Use a different custom python script to orient all reads in an un-joined fashion & trim off any primers, demultiplex, then run DADA2 as denoise_paired

After the feature table is generated, I filter it down to just the ATCC mock sample, and then BLAST the resulting ASVs against the full genomes of the 20-strains in order to identify each ASV as belonging to the original mock community.

In the table below, I've identified the number of reads that have a BLAST hit to each of the strains (true positives marked in green & any non-detected strains in red) as well as any false positives (marked in yellow).

Note that all BLAST hits were either 100% ID or had just a single base miss-match. In many cases, two ASVs would hit the same strain, in this case I've separated the read counts out (e.g. flavor 1 had two ASVs that hit A. baumannii with 31 & 30 counts).

Question 1: which method is better?

Which pipeline is the most "correct" way of doing things? I take it that it should be flavor 3; however it gives me the "worst" results. :frowning:
Should I be doing anything differently?

Question 2: why multiple ASVs?

In almost all of the flavor 1 & 3 results, we see that multiple ASVs are identified for a given strain. Let's examine for example A. baumannii; the representative sequences found for each flavor are:

Flavor 1, ASV 1

TACAGAGGGTGCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTATTAAGTCGGATGTGAAATCCCCGAGCTTAACTTGGGAATTGCATTCGATACTGGTGAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACGCTGAGGTACGAAAGCATGGGGAGCAAACAGG

Flavor 1, ASV 2

CCTGTTTGCTCCCCATGCTTTCGTACCTCAGCGTCAGTATTAGGCCAGATGGCTGCCTTCGCCATCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCATCCTCTCCCATACTCTAGCTCACCAGTATCGAATGCAATTCCCAAGTTAAGCTCGGGGATTTCACATCCGACTTAATAAGCCGCCTACGCACGCTTTACGCCCAGTAAATCCGATTAACGCTCGCACCCTCTGTA

Flavor 3, ASV 1

TACAGAGGGTGCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTATTAAGTCGGATGTGAAATCCCCGAGCTTAACTTGGGAATTGCATTCGATACTGGTGAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACGCTGAGGTACGAAAGCATGGGGAGCAAACAGG

Flavor 3, ASV 2

CGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTATTAAGTCGGATGTGAAATCCCCGAGCTTAACTTGGGAATTGCATTCGATACTGGTGAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACGCTGAGGTACGAAAGCATGGGGAGC

In the case of flavor 1, the representative sequences are reverse complements of one another, this isn't surprising since we didn't do any read orientation. However, in flavor 3, ASV 2 is a perfect subset of ASV 1 (the section that has been bolded). What is causing this and how can it be alleviated? I'm guessing it has something to do with the quality scores for the reads that get flipped into the correct orientation and the subsequent error models that get built around it.

flavor1.sh.txt (2.9 KB)
flavor2.sh.txt (2.4 KB)
flavor3.sh.txt (2.6 KB)

Hi @schillebeeckx!

We have a plugin to answer precisely this sort of question with mock communities: q2-quality-control. That should help.

Sounds like you are classifying only against the expected genomes. It would be interesting to compare results with eval-composition on both this classification and on classification with standard methods (e.g., with greengenes or silva 16S databases)... since I assume you are validating these "flavors" to use on datasets where you are classifying unknowns.

Also see this tutorial. Since you have expected sequences (genomes for each expected species, though you may want to trim out the 16S genes), you can determine how well the denoised seqs with each flavor match these.

It looks like flavor 3 is "worst" because you are getting more false negatives... is that what your assessment is based on? Notably, these false-negatives are largely seqs that are low abundance in the other "flavors" and/or species that are difficult to distinguish from others (e.g., the Staph species I believe are very very similar within 16S).

If you want to calibrate on multiple mock communities, I curate a compendium here (to which you are very welcome to contribute your own). This would help avoid overfitting to a single dataset.

You are definitely correct about reverse complements being dereplicated as separate ASVs... dada2 does not look for reverse complements when dereplicating and this can definitely cause issues, e.g., with alpha diversity or comparison to oriented libraries.

Multiple rRNA operon copies will also lead to this effect... for some species there are slight differences in 16S seqs between copies. But that does not explain what you are seeing here.

This is really bizarre and I assume is some artifact of the reverse complementation scripts you are using. Any clue what the flanking seqs are? e.g., fragments of primer? I am not sure that this is an effect coming from any part of QIIME2.

You could use q2-cutadapt to trim off these segments if they are always the same sequence.

That is definitely an intriguing thought, though I am very skeptical that this would cause such a dramatic effect. Flipping the quality scores would probably muddy dada2's error models — all told, the kosher thing to do would probably be to run dada2, then reverse complement and re-dereplicate. @benjjneb any thoughts on this? @schillebeeckx that could make for a very useful method for a qiime2 plugin if you are interested in going down that path :wink:

I hope that helps!

For the R package, we currently recommend Flavor 1, followed by resolving the reverse-complementation in the output sequences table. However, I’m not sure if there is a way to do this in Q2 yet, i.e. orient sequences against a reference set, and then merge duplicates that arise.

If you are performing V4 sequencing with essentially total overlap of 2x250nts, Flavor 2 is probably the best choice. We do not officially recommend pre-merging reads because in longer-amplicons, e.g. V3V4, the heterogeneity between error profiles in the overlapping and non-overlapping parts of the merged reads can lead to false positives: https://github.com/benjjneb/dada2/issues/161v

I’ll be honest, I don’t understand what is going on in Flavor 3, but it probably has to do with the trimming step on the mixed F/R reads. I wouldn’t recommend going this route.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.