To cluster or not to cluster?

Mehrbod_Estaki · May 30, 2019, 8:10pm

Hey all,
I'm always intrigued by this topic so I just thought I'd throw in my 2 cents as well, if nothing at least I'll have somewhere to reference in future posts. But I look forward to reading other views on the matter.
First just to clarify terminology because sometimes using the wrong terms can add extra confusion, especially to newcomers. The product of denoising methods such as DADA2, DEBLUR, UNOISE, (and perhaps MED?) are Exact Sequence Variants (ESV). These different methods all arrive at these variants slightly differently and so they have different names. DADA2 ESVs are called Amplicon Sequence Variants (ASV), Deblur calls them sub-OTUs (sOTU), and UNOISE method calls them zero-radius OTU (zOTU). It use to bother me that we didn't call them all the same thing but the fact that by simply calling them their correct name I can figure out what method produced them has grown on me. You also hear the term feature thrown out there to refer to these also but that's a different discussion. The common idea behind all of these is that they deal with the exact sequences itself at all time and they don't worry about similarity, lineage etc. As @jwdebelius nicely described even a single nt difference between them, be it addition, deletion, or mutation will lead to calling a new ESV; since we are discussing DADA2, I'll stick with ASV.

Now, what we do with the ASVs (and everyone agrees, we should all start with ASVs) from here on is completely question/data driven. For example: collapsing the ASVs to the their closest taxonomy, functional clades, phylogenetic clades etc. and of course clustering down to OTUs at some similarity threshold. These are all just methods that we can choose to find patterns and make our data more human digestible. They all have their advantages and disadvantages. Take OTU clustering for example, a disadvantage is as @jwdebelius described, you may lose interesting trends when you collapse ASVs, but one could argue that an advantage of clustering would be you don't falsely split the same organism into multiple ones thus inflating its influence on community metrics. This can happen when you consider some organisms have multiple 16S gene copies and those copies differ from each other by 1 or more nts. Even messier is that these differences can be identified in some hypervariable regions but not others. The copy number problem is even crazier in fungi. when those differences can reach dozens or more. Your choice for all this then must be question-driven taking into account the advantages and disadvantages of each method you choose.

Coming back to some of the unanswered questions now. --p-trun-len is not necessary for dada2 but is required for Deblur and the recommendation to truncate to equal length has more to do with quality control, the denoising algorithm's specific requirements, and chimera detection than biological importance. Of note, setting truncating is not recommended for ITS data however due to the large natural variation of that region. In Qiime2 you have the option of trim/truncating at either the 5' or 3' ends on either forward, reverse, or both reads. Variable lengths can occur because of natural variation in the region, sequencing errors, and chimera formations (and probably some others). Variable length is more likely to be introduced when considering longer target regions such as those reads from paired-ends. Trimming your reads to a fixed length is one option to deal with the so called identical reads being called 2 different ASVs issue, but so is leaving them as is. For example consider these 2 ASVs :

 AATTGGCCAATT
GAATTGGCCAATT

These may be different strains of the same species with 1nt difference, different species with 1nt difference in this particular region (more differences might occur elsewhere), or the different 16S gene of the same organism which happen to differ from each other by 1nt such as S. aureus. There is really no way to be sure and trimming that first nt to get equal length doesn't resolve the ambiguity either, simply covers it up and pretends we know. This is a limitation of short amplicon sequencing in general with no real solution at the moment. However, there is some comfort to be found in the fact that if such limitations confound your data, these should occur evenly across all your samples/groups so at the end of the day even if your data is not perfect, the patterns between your experimental groups should still stand out.
I think I'll stop there.