Hi @DaS,
Excellent questions all around!
Unfortunately as you may have suspected there is no clear-cut answers to any of your questions but rather some recommendations and guidelines. Let's jump in.
As long as you allow enough overlap bp for proper merging, I'm all for using aggressive truncating parameters since these are less error-prone. Quality over quantity. It also reduces processing time as well. Proper merging with dada2 requires a min 20bp overlap, but consider natural length variability, so depending on your region target you may want to leave more if you can afford it.
I have seen 20 and 25 median scores being recommended as a 'starting point'. The higher the better obviously, but not all datasets have the luxury of being able to use higher cutoff points due to low read #s or the need for longer amplicons for proper merging. With small fully overlapping regions like yours in this case, this is much easier, especially since the overlaps can significantly reduce errors.
Of course we would expect changes since we are dealing with different ASVs and lengths, and from my experience these changes are more likely to affect alpha diversity measures than beta diversity. Phylogenetic-based indexes are also more resistance to these parameter changes from experience. Could you provide a bit more detail about what is changing between your experiment results in these scenarios? Another way to resolve parameter effects is to collapse your features to a say genus level and perform your analysis then. It's hard to make the argument that one parameter is more correct than the other, but rather that they are rooted in different amounts of information.
Automating these processes is a very interesting and ambitious idea, something I've thought about a lot myself but haven't found a good convincing way of doing yet, especially with PE data, since there are some key decision making steps that require user-input. This is easier with single-end or PE data with almost complete overlap, so once you figure out the nature of your sample-type perhaps you can do this with your primer sets. And Q score of 25 is not rather low in my opinion as it corresponds to an error rate of about 0.003%! Also consider that overlap regions can reduce this significantly with consensus. But yes, the higher the better...
Good question, and again this depends on what you are asking your data. Personally I would say why not go for the shortest amplicon length that seems reasonable in a length distribution plot. For example, you expect amplicon length of 250-252 but what about one that is 240? It is likely a true feature that is naturally shorter. If we are too conservative with these length we are introducing length bias and preventing ourselves from real discovery. Since deblur uses a positive filter anyways, I would just stick with a reasonable minimum length and not get too greedy with length. As for deblur, there is a secondary motive to use shorter lengths anyways (explained below)
Unfortunately this kind of decision making is still necessary for this kind of analyses and one of the reasons why I hesitate with automation. Finding a compromise between resolution, discovery, and depth is very much so dependent on your experiment's question. If you expect high effect size between your groups then all this will likely not matter much, if you are looking for discovery and sensitivity in your data then some fine tuning needs to be done.
This topic has been covered quite a few times on the forum already so might be wroth doing some searching. And your observed points are pretty valid and on par with my experience as well. I don't think one method is superior to another other in all cases, each have their own strengths and weaknesses but both perform very well in most cases. For a more thorough comparison check out this recent paper that compares these methods: Denoising the denoisers. But a few thoughts that may help some decision making.
As you already mentioned deblur is very convenient for analyzing multiple projects, in fact I believe that was one of the key factors driving its design. Dada2 of course can also be used for this purpose, but it requires that equal parameters be used across the studies, or ultimately collapse the final merged tables to a common level (i.e genus) which is less informative than ASVs.
Dada2 can work with variable lengths amplicons which resolves the decision making process required by deblur, so in your case instead of deciding where to trim ALL your sequences in deblur, you can include them all using dada2.
One key defining differences between the two is the error model used for denoising. Dada2 trains its own model per each run and the algorithm can be applied across sequencing platforms, whereas deblur uses a pre-packaged model specific to Illumina machines. This training step as you imagine also adds processing time to a run so you are going to experience longer processing time with dada2. This is exaggerated as bigger data is used.
Your comment about deblur producing less false positives is more true as the amplicon lengths increase compared to dada2 but likely at the price of being too conservative. Check out this post by one of deblur developers regarding the calculation used for deblur's expected error rate and how significantly the length can affect it. With dada2 this is less of an issue since your error model is run-specific so it MAY be more sensitive if you have good quality data.
Finally, if you are experiencing too many false-positives from dada2 (confirm by blast) you could always try using a positive filter on your feature table the way deblur does and see if that helps.
Hope this helps a bit.
ps. not-proofread. Excuse any errors...