collapseNoMismatch

M_R · February 14, 2022, 3:30pm

Has this DADA2 feature already been integrated in QIIME2?

In my opinion, this feature should be added as a "default" to running DADA2. I don't really see a reason why one would want to have identical ASVs of (slightly) different sizes clustering into separate ASVs (maybe someone could help me with seeing why this could be useful?), at least not in my case with 16S data.
As the way it is at setup the moment, it is critical (too critical imo) to have sequences of identical sizes. I know using something like cutadapt to look for the primers sequences and trim from that point on is probably the "default" best way to tackle this at the moment, but wouldn't it be more efficient (computationally) to just group reads (with just simply trimmed of sequencing primers) that are 100% identical (regardless of size) into ASVs? Not sure how much more computationally efficient it would be, but it seems to me that "trim and group identical sequences (not matter their lengths)" would be more efficient than "look for pattern (primers), trim and group identical sequences (different lengths, different groups)"?

Now that I'm thinking more about it, one possible "problem" with this could be comparing different runs? Since they might have different representative sequences (of different lengths) for the same sequences? One would then first have to perform a step for clustering (not caring about lengths) the (representative) sequences from all runs to basically generate new representative sequences (for the "merged run"?

It may well be that I'm missing some stuff here and am over-simplifying it all. So, it would be great to hear some opinions.

Nicholas_Bokulich · March 2, 2022, 7:29am

Hi @M_R ,
Thanks for the suggestion. We have discussed adding this option to q2-dada2 for a few years now, and have an open issue at the moment. It has stalled for two reasons:

the collapseNoMismatch option is actually a separate step in the dada2 R workflow, which basically corresponds to OTU clustering at 100%. This could be done, in theory, using the q2-vsearch plugin with de novo clustering at 100% to re-cluster ASVs that are trimmed at variable lengths.
in general, there is quite a bit of disagreement that this option is even desirable in a "typical" dada2 workflow (but this is where obviously opinions diverge based on use case and biological questions). For single-end reads, truncating to the same position is generally recommended, as opposed to truncating to different positions (see some discussion here). Paired-end data probably should not be relevant here unless if some sort of variable spacer is used so the start positions are different (this is rare). So this is not to say that I disagree with you that having such an option would be convenient, only that this probably should not be the default.

But if you are interested in contributing this option to q2-dada2, you can see the open issue and discussion here. PRs are welcome

M_R · July 29, 2022, 2:41pm

Thanks for the reply!

I did indeed "solve" my issue by using vsearch. And I see where you're coming from in that it might not be desirable to have it as default, but I think this would definitely be a welcome addition. In our case the problem was indeed the use of heterogeneity spacers. On further inspection, it turned out that we should've been using cutadapt to trim primers instead of positional trimming (which would lead to different sizes because of the spacers). So, unfortunately we were one of these rare cases

But I guess we learned and came out with the better/correct method to get rid of these sequencing primers (definitely when they're spacers involved).