De novo VS Closed Ref

Good evening to everyone,

I am sorry if maybe this question is “too basic” but I have a doubt about the two approaches for otu-picking: de novo and closed reference.
I understand more or less how do they work, but I dont understand why should I use one instead of other? which are the benefits and limitations? Why everyone use always the de novo assembly and only for Picrust the closed reference?
Thanks in advance to everyone will help me understanding this.

1 Like

Hi @michele_quail ,

This can be a little bit of a controversial topic in some circles. It, in part, depends on what type of sample you’re working with and what your options are. My work has focused primarily on western humans, and so my views reflect that bias. Other people, who work in other environments, may have a different perspective.

I want to start with the one cases where you absolutely must use closed reference picking, and then work to less strict scenarios. So, the place where you absolutely must used a closed reference method is when working with a meta analysis with multiple hypervariable regions. De novo clustering or denoising of different regions will conclude that there are differences based on the region. (It’s worth noting that hypervariable has a strong signal in general, regardless of clustering/denoising method).

Its my opinion (even if its not generally accepted) that you should use close reference picking or denoising for well defined environments with good references (i.e. the human body, model organisms, etc). This is because if we both use a closed reference method against the same database, my OTU 737 and your OTU 737 should be the same OTU. Which, biologically or from the perspective of a simple textual “am I seeing the same thing” is really helpful.

Closed reference picking also, as you mentioned, allows you to make use of PICRUSt and other OTU-based algorithms.

I think the other major limitation of closed reference picking is the level of trust you place in your reference database, and which reference you use. Essentially, your OTU picking and assignment becomes as good as the reference you select. The annotation aspect is true for other algorithms (taxonomic assignment based on a naive bayesian also requires a good training set), but I think it’s less prominent in those cases.

One of the major complaints against closed reference picking is that you sacrifice sequences. This is a fair criticism, I think, in environments that are not well characterised. So, if you’re working with an environmental sample from a new extreme environment or an environment which just isn’t covered by one of the databases, you may give up half your sequences to closed reference picking. As a result, a lot of people fall into the camp of “cluster everything and then filter”.
This may also reflect some of the historical issues in the field: de novo algorithms were some fo the first released, meaning they tend to have more citations. And, there are certain camps who do not use closed reference.

A third approach, which I tend to see more in the literature than de novo is open reference, which is a combination fo the two. It retains the advantage of closed reference picking, in that you get the benefits of having a reference and therefore more comparable data, but also allows you to retain more sequences.

Finally, I want to mention denoising, as a category unto itself. Because denoising (deblur and dada2 are implemented in QIIME 2) are per-sequence algorithms, rather than clustering-based algorithms, they give you some the features of closed reference and de novo picking. Namely, because your identifier is an individual sequence rather than a cluster, it can be compared across multiple datasets easily (as long as they’re the same sequence length and region), but also allows you to keep as many features as possible.

tl;dr: Closed reference only for combining hypervariable regions; closed reference with well defined environments; de novo or open w with poorly defined ones, but really, denoising when you can get away with it



Thanks a lot really for your explication and even for the article. It was very useful for me.
I don’t know this last approach you are mentioning here, the one with dada2.
I will look for articles on it.
Thanks a lot

Just a further elucidation: If I do a de novo assembly and after match with reference database I remore NA, is it the same that have a closed reference approach?

No, I don’t think so. I’m not sure what the motivation would be to do that with OTUs. I could maybe see a use to do closed reference clustering for denoised sequence, which was done in the American Gut Manuscript to address some methodological issues. But, if you’re dealing with de novo OTUs and you have the raw data, why not just re-pick so you have consistent results?


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.