Dada2 chimera filtering and beyond

devonorourke · March 7, 2019, 6:51pm

Hoping to clarify that DADA2's chimera filtering strategy is a denovo process; that is, there is no use of a set of reference sequence with which to check for chimeras? This older thread from @benjjneb seems to suggest as much.

It looks like @wasade confirmed in this Github post that deblur uses vsearch's implementation of uchime_denovo.

Fair to say that both DADA2 and Deblur use denovo appproaches for chimera filtering?

I'm curious if users have explored any potential benefit of additional chimera filtering with the vsearch implementation of uchime_ref? Related: what's the down side to using a reference library over de novo identification?

Thanks!

colinbrislawn · March 7, 2019, 7:17pm

Yep!

I usually do uchime de novo first, then uchime ref second. This remove the maximum possible number of chimeras.

The main issue with all forms of chimera checking is "What if my database doesn't have the parents of my chimeras?" So de novo methods work well because they use themselves as their own reference, and ref methods work well as they use a large database of known real microbes as a reference.

I'm curious to see if any of the qiime devs use both de novo and ref based methods...
@thermokarst @Nicholas_Bokulich @benjjneb

Colin

benjjneb · March 7, 2019, 7:35pm

Yes this is correct. It may be worth being aware that the plugin gives you the option of two types of chimera filtering: "consensus" does de novo identification in each sample, takes a vote across samples, and removes all ASVs identified as chimeras in a high enough fraction of the samples in which they were present. "pooled" just lumps all ASVs in the data into one big sample, and identifies and removes chimeras that way.

In our testing "consensus" performs better for typical datasets/workflows.

Also in my testing, additional reference-based chimera removal on top of the de novo removal isn't a net positive when considering both sensitivity and specificity (i.e. there are some false positive chimera IDs at that step). However, my testing there has not been exhaustive.

devonorourke · March 12, 2019, 2:12pm

Related question about DADA2 filtering parameters - sorry to be a bother again @benjjneb!

Is there a parameter in DADA2 that removes singleton ASVs in the QIIME implementation (or in a standalone DADA2 program)? For example, say the resulting OTU table looked like this:

Feature   Sample1   Sample2   Sample3
ASV1      5000      1000      5000
ASV2      0         20000     0
ASV3      100       300       500

Would ASV2 be dropped from the dataset entirely, or retained?

From what I can tell DADA2 retains all ASVs regardless of how many samples they are present in. Just wanted to double check that is the case, and that there isn't a parameter that allows for this function to be turned on/off like in qiime filter-features.

Thanks again!

benjjneb · March 12, 2019, 2:43pm

Yes that is the case. Filtering based on prevalence (number of samples in which an ASV is present) can be done afterwards using filter-features.

devonorourke · March 12, 2019, 2:45pm

Perfect. Thanks for the clarification!

devonorourke · March 19, 2019, 1:38pm

Quick follow up for @benjjneb. Hopefully I'm interpreting the DADA2 manual section describing a consensus strategy properly: within QIIME2, by default, a consensus chimera process is invoked, and this is triggering the isBimeraDenovoTable function within DADA2.

I'm specifically curious about one line in that function: minSampleFraction = 0.9. Am I correct that by default, this argument is requiring that 90% of all samples in my dataset have that chimera for it to be removed?

So, for example, if I had 200 total samples being processed in a dataset, is a suspected chimera only removed if it is present in 180 samples?

If that's the case, I'm wondering what led to requiring such a high threshold in your testing/development of the parameter. That seems to me to indicate that chimera formation is quite favorable in PCR? From what R. Edgar's posted about chimeras, it seems more like 1-5%. Not saying he's correct, just noticing a huge disparity in the parameter from what he's stating.

Thanks as always for your help and insights!

benjjneb · March 19, 2019, 2:03pm

Not quite, the criteria is based on the number of samples in which that ASV is present.

So if an ASV is present in 40/200 of the samples, then it should be identified in ~90% of those samples or more (i.e. 36+/40) to be identified as a chimera and removed.

That threshold was set based on some sensitivity/specificity testing done on the data reported on in our 2017 PNAS paper on the vaginal microbiome and preterm birth, but we've not formalized the analysis into its own publication.

devonorourke · March 19, 2019, 2:06pm

Ah - that makes far more sense.
Thanks for clearing it up!