Hi @michele_quail ,
This can be a little bit of a controversial topic in some circles. It, in part, depends on what type of sample you’re working with and what your options are. My work has focused primarily on western humans, and so my views reflect that bias. Other people, who work in other environments, may have a different perspective.
I want to start with the one cases where you absolutely must use closed reference picking, and then work to less strict scenarios. So, the place where you absolutely must used a closed reference method is when working with a meta analysis with multiple hypervariable regions. De novo clustering or denoising of different regions will conclude that there are differences based on the region. (It’s worth noting that hypervariable has a strong signal in general, regardless of clustering/denoising method).
Its my opinion (even if its not generally accepted) that you should use close reference picking or denoising for well defined environments with good references (i.e. the human body, model organisms, etc). This is because if we both use a closed reference method against the same database, my OTU 737 and your OTU 737 should be the same OTU. Which, biologically or from the perspective of a simple textual “am I seeing the same thing” is really helpful.
Closed reference picking also, as you mentioned, allows you to make use of PICRUSt and other OTU-based algorithms.
I think the other major limitation of closed reference picking is the level of trust you place in your reference database, and which reference you use. Essentially, your OTU picking and assignment becomes as good as the reference you select. The annotation aspect is true for other algorithms (taxonomic assignment based on a naive bayesian also requires a good training set), but I think it’s less prominent in those cases.
One of the major complaints against closed reference picking is that you sacrifice sequences. This is a fair criticism, I think, in environments that are not well characterised. So, if you’re working with an environmental sample from a new extreme environment or an environment which just isn’t covered by one of the databases, you may give up half your sequences to closed reference picking. As a result, a lot of people fall into the camp of “cluster everything and then filter”.
This may also reflect some of the historical issues in the field: de novo algorithms were some fo the first released, meaning they tend to have more citations. And, there are certain camps who do not use closed reference.
A third approach, which I tend to see more in the literature than de novo is open reference, which is a combination fo the two. It retains the advantage of closed reference picking, in that you get the benefits of having a reference and therefore more comparable data, but also allows you to retain more sequences.
Finally, I want to mention denoising, as a category unto itself. Because denoising (deblur and dada2 are implemented in QIIME 2) are per-sequence algorithms, rather than clustering-based algorithms, they give you some the features of closed reference and de novo picking. Namely, because your identifier is an individual sequence rather than a cluster, it can be compared across multiple datasets easily (as long as they’re the same sequence length and region), but also allows you to keep as many features as possible.
tl;dr: Closed reference only for combining hypervariable regions; closed reference with well defined environments; de novo or open w with poorly defined ones, but really, denoising when you can get away with it