Thanks for sharing your nmds plots, @Estelle!
Okay that's making a lot of sense then — non-phylogenetic metric will be even more sensitive to this effect than UniFrac (I predict) because each feature will effectively be considered equidistant in regards to similarity. So SVs that differ by 1 nucleotide (e.g., strain-level differences) are going to have the same impact on sample distance as SVs that differ by 100%.
That's effectively what you are seeing here: Each treatment group has a distinctive feature profile but different "strains" (e.g., SVs with very few differences) are present in the different samples/technical reps, causing technical reps to cluster together.
I suspect that "treatment" is still separated by the x- or y-axis, so the dada2 results probably do replicate the same treatment effect, it's just that your SVs are so sensitive that they also distinguish samples/reps! This result seems like a pretty good advertisement for dada2 to me (look at that technical precision!) but I can also understand if this causes some apprehension.
For the past many years we've relied on OTU picking for microbiome studies, so the noisy clusters we see on ordination plots have become ingrained as "normal" — data like yours may reveal the true potential of dada2 and other denoising methods for removing that noise, so it may look "abnormal" now, but is actually improving resolution. @Micro_Biologist makes a good point in this regard (thanks for bringing this us, @Micro_Biologist!) — OTU picking the denoised SVs (with q2-vsearch) and analyzing in the same way could indicate whether sequence errors masquerading as OTUs are responsible, or if OTU clustering "blurs" the data as we discussed. It is probably a mixture of both.
I will let the UniFrac developers defend their own method, but the main reason (in my mind) is in part what we are seeing here: on the one hand, accounting for phylogenetic distance between features will minimize the effect of "strain"-level differences (or so I suspect you might see a smaller effect in your data if you used UniFrac); on the other hand, it will magnify the effect of larger differences, helping differentiate other samples.
In many cases accounting for phylogenetic distance may be related to an assumption that features that are more similar phylogenetically will also be more similar functionally — but either way it can be a useful method for teasing apart some sample types.
At the end of the day it's just another tool in our toolbox — and using different distance metrics to make informed decisions/comparisons can be very useful for inferring biological differences (just as comparing SV/OTU results can be very informative). Could be worth use here to support this inference (or completely refute everything I've just written, which is okay too!)
Thanks!