Hi, my name is Kejty, and I am still a newbie in the world of bioinformatics.
I am studying gut microbiota from bivalves, along with water and soil microbiomes. My samples were sequenced on Illumina, targeting the V3–V4 region of the 16S rRNA gene.
I am facing a question about which cluster is the most suitable (de novo, open-reference, or closed-reference).
I work with a closed-reference OTU cluster as it is reproducible across studies. I got a suggestion to do de novo because closed-reference is more suitable for a very well mapped environment (e.g., humans).
I read several discussions and articles regarding this topic, but I am still wondering if my choice is right.
Many have moved to using amplicon sequencing variants (ASVs), over the traditional OTU clustering approach. Even for eDNA and diet surveys using other marker genes. I suggest the following papers. That is not to say that OTUs are invalid, but you might obviate the worry about clustering choice by using ASVs. You can also cluster your ASVs and see how things change at successive clustering levels.
Is there an analytical reason why you need to work with OTUs? Knowing this might better help us to help you. We generally advise against using OTUs, but understand why it may be necessary. But the first article I previously linked more eloquently explains why.
Like I said you can generate ASVs, to clean the data. Then, cluster into a variety of OTU thresholds, and see what works best for your data to. Especially, what range of clustering works best for your questions. It could be likely that several thresholds work, say 97% - 98% might provide similar results, and 99% - 100% might be similar. If it turns out that 100% clustering works, then you can stick with ASVs (basically 100% denoised sequences). Perhaps others on the forum might have some related experience and resources they'd like to share in this regard in helping you make a reasonable decision.
Though, as someone who has experience in invertebrate research, I wanted to point out that the use of ASVs is quite common in bivalve research. Here are some recent examples:
I’ll also mention that if you did not want to do any denoise-style preprocessing then you’ll need qiime vsearch dereplicate-sequences which will just make a very large table that can then be clustered by the above methods.
Hi, I am sorry for the late response. And once again, thank you for the great response and the publication.
I think the reason is that my supervisor never worked with ASVs.
I did as you recommended, ASVs with Deblur, and then clustered them with 97, 98, 99 and 100% tresholds.
As expected, I got the most OTUs for the 100% threshold:
97%: 629
98%: 763
99%: 910
100%: 1,142
And then I checked how many OTUs I have if I keep only the OTUs that appear across several samples, and again, the 100% has the most OTUs. So if I understand it right, there are many more rare OTUs shared across several samples, but also the sparsity increased with 100% threshold.
After this, I run Unweighted PERMANOVA, and the results for all thresholds are very similar.
For my current work, I have to use OTUs. 97% threshold would be easier for me to interpret; nevertheless, 99 or 100% shows finer results. Which makes me wonder which is the best option.
Great observation! I will say that there are often a few more data cleaning steps, like removing ASVs / OTUs with poor taxonomy, removing low abundant reads (e.g. singletons), ASVs that only appear in a few samples, etc... You can read about some of the examples here.
Often when I do these additional filtering steps, the ASV / OTU count drop substantially, but the read depth changes much less...
For example, let's say you perform all of additional filtering you might go from 1,100 OTUs down to 450 OTUs, but your total read count might only go from 1,500,000 reads to 1,488,000 reads. That is, many of those reads are in fact the rare (low count) or singleton reads you are observing at 99 - 100% clustering. I've observed this in many data sets.
In general, OTUs over-cluster some taxa and under-cluster others. ASVs can separate single taxa into different units (under-cluster). Different taxa evolve at different rates. Though in many cases OTU clustering is "good enough". So, I think you are fine choosing what makes sense to you given your research questions. One of the papers I linked earlier in this thread (Glassman & Martiny 2018) should help here.
Basically, it comes down which side of the continuum you are most comfortable. From being a "lumper" (OTU clustering) through being a "splitter" (ASVs). Which way would you rather lean towards, given your research questions. But 97% seems quite reasonable to me.
As 97% was originally used as a proxy for an estimate of within species diversity, it's not perfect. Actually, one of the main reasons OTU clustering was commonly used was to help remove noise, prior to denoising algorithms. That is many spurious reads, singletons, etc.. would lump themselves into clusters with other sequences. Clustering is likely more useful for full-length (~1400-1500 base) sequences, as clustering at this cutoff level for short (250 - 450 base) reads will more commonly over-cluster compared to their full length counterparts. That is clustering full length reads at 97% will often give you more OTUs than clustering the same gene sequences using a specific variable region, like V4. As many variable regions may contain the exact same sequence in that region, but differ over the rest of the gene.
Another option is to taxonomically classify all the reads, and collapse the ASVs / OTUs by taxonomy. Though this might result in different groupings of taxonomy-based OTUs depending on the reference database you use.
Sorry for all the information, I just wanted to provide more context. I am sure others have their thoughts on the matter. But to reiterate, if you are comfortable with using 97% or 98% go for it! It seems like your OTU clustering experiments narrowed down a useful range for you. It is also great that you tested for differences and found them to be minimal. Now you can respond to Reviewer 3.
That is a tough question to answer. I toggle between the two approaches myself. Some datasets I've analyzed work much better with deblur, others with DADA2. Generally speaking, I think it has to do with how the quality control and denoising occur.