Closed Reference OTU picking vs taxonomic annotation

osman · February 3, 2020, 10:27pm

Hi, I have no background in these analysis but I would like to clearify my mind in the following.

During OTU clustering, one of the methods is Closed reference.

How is it different than reqular taxonomic annotation other than having disadvantage of being not able to identify non existing taxonomy? Is not it exatly doing the annotation on all sequences ?

Thanks in advanced.

jwdebelius · February 4, 2020, 8:58am

Hi @osman,

Welcome to the :qiime2: forum!

So, let's step back a minute and talk about what what happens we we do sequencing. We essentially start with a group truth sequence. That then goes through PCR and sequencing which add some noise and some error. There are a bunch of ways we can approach this. We could blast against a reference database (time consuming!) or we could cluster and hope that the reads that clustered together were from the same original read.

In the beginning (ancient times that they were in like 2005-2010/201ish), the best way to do this was to do de novo OTU picking, which took the sequences, compared them to each other, and spit out read clusters. Then, you did taxonomic classification by naive bayes classifier or blasting the centroids or something, and you got a phylogenetic tree by doing de novo alignment on your centroid. The advantage was that you got to keep all your reads (whoo!). The disadvantage was that the clusters weren't externally valid or necessarily stable. Introducing a new sequence could shift everything!

And so, one solution to this was the move toward closed reference picking against databases (2010-2015ish) where the OTUs were clustered against an existing external framework. The advantage is that no matter how many more sequences you have, they should (theoretically) behave the same way. (YMMV). The tree and taxonomy then came from the database. The advantage is that everything is consistent within a database and hypervariable region (whoo!) and you can combine sequences from multiple regions against the same reference (whoo!). The disadvantage is that if a database wasn't well defined for your environment, you would lose sequences. Also that your results were only as good as your reference, so you have to trust your reference. IMO, closed reference works really well for defined human communities and is problematic elsewhere.

Open reference was the answer to the problem of a poorly defined enviroment: anything you could cluster against hte reference got that identifier and taxonomy and then what couldn't be clustered was scooped up, clustered on its own, and annotated. It's faster and more externally valid than de novo but keeps more reads in new enviroments that closed reference.

The problem with all the OTU picking methods is that they cluster at some identity threshold which means you (potentially) lose out on real biological variation. If you have two organsims which only differ by a single basepair in their 16s sequence over the region where you're looking, you probably won't see them at the standard 97% clustering.

And then, in 2016, denoising was introduced. (There was a group who was doing oligotyping in like 2015ish, but we dont really talk about that anymore so we're going to ignore it here). Denosing is this idea that we know what error looks like... so we just remove the error! There are a couple different algorithmic approaches to this problem (two of which are implemented in :qiime2:) but that's essentially it. And then, you get your taxonomy by blasting or clustering or a naive bayes classifer or improved naive baysian classifier... (There's been a lot of developments in 16s in the last like 4ish years).
...You can also denoise to ASVs and then cluster again!

Okay, so given the slight history lesson, here's my summary table:

criteria	de novo	open reference	closed reference	denoising
requires database	No	Yes	Yes	No
keeps all your high quality reads	Yes	Yes	No	Yes
externally valid vs	no	kind of	Yes	with same trim length and hypervariable region
combine multiple hypervariable regions	no	no	yes	no
taxonomic annotation	classifier	database & classifier	database	classifier
single nucleotide resolution vs	no	no	no	yes

Ultimately, my (personal) recommendation is to just go for ASVs. It's 2020, a new decade, and there's not a lot of reason to throw out resolution when we have the technology to look at it!

But, there was also a great thread debating these issues a while back:

Best,
Justine

osman · February 4, 2020, 12:38pm

Thank you Justine for your time and effort for the detailed explanation.