DADA2 - "maxMismatch rate" as an alternative to "maxMismatch"

mverce · August 3, 2021, 4:51pm

Hey,

I'm not sure if this is the place for this and I apologise if this topic has been covered before, but I noticed a taxonomy-dependent bias when merging and I think adding a "maxMismatch rate" to DADA2 (similar to cutadapt's "max-expected-errors") could be helpful in this regard.

Recently, I have been processing V3-V4 data with DADA2. The length of this region is variable, as brought up by @KQUB and @benjjneb in another topic. For example, the modal length of V3-V4 (without primers) is around 404 bp for Clostridia and 430 bp for Desulfovibrionia. This means that the overlap region can be anywhere between ca. 20 and 50 bp long.

If one is processing such a data set with less than perfect quality toward the end of the reads, one may try to increase the maxmismatch option from 0 to some small number to increase the success of merging. However, even when assuming constant or favourable quality values, the expected number of mismatches is higher for the longer overlaps, meaning that setting a fixed number of allowed mismatches (e.g. maxmismatches = 2) still penalises the taxa with shorter V3-V4 (= longer overlap). This can be observed by comparing the taxonomy barplots obtained by processing paired reads or forward reads only. I ended up using forward reads only, in order to avoid the bias and to ensure that good quality sequence data was used.

One way to avoid the bias would perhaps be to just increase the maxMismatch until the bias is not observed anymore ... but that seems like it can break the merging of the short-overlap situation .
The other solution could be to add an option to set a maximum mismatch rate, e.g. 5%, 10% ..., perhaps even in combination with an absolute maximum. That way, the shorter and longer overlaps are put on the same footing. Would it be possible to add that? If not, does anyone know (or is using) any other solutions? I'd appreciate any feedback!

Whatever the case, I really like this tool and will for sure keep on using it

Keegan-Evans · August 5, 2021, 5:08pm

@mverce,

This would have to be implemented directly in DADA2 first, as the functionality in QIIME 2 is provided by wrapping DADA2, which is an R package. I am sure @benjjneb would welcome this as a contribution!