bivalve gut microbiota, water and soil samples What OTU cluster?

Hi, my name is Kejty, and I am still a newbie in the world of bioinformatics.

I am studying gut microbiota from bivalves, along with water and soil microbiomes. My samples were sequenced on Illumina, targeting the V3–V4 region of the 16S rRNA gene.

I am facing a question about which cluster is the most suitable (de novo, open-reference, or closed-reference).

I work with a closed-reference OTU cluster as it is reproducible across studies. I got a suggestion to do de novo because closed-reference is more suitable for a very well mapped environment (e.g., humans).

I read several discussions and articles regarding this topic, but I am still wondering if my choice is right.

Anyone open to discussion? :slightly_smiling_face:

Hi @Katerina_Fialova,

Many have moved to using amplicon sequencing variants (ASVs), over the traditional OTU clustering approach. Even for eDNA and diet surveys using other marker genes. I suggest the following papers. That is not to say that OTUs are invalid, but you might obviate the worry about clustering choice by using ASVs. You can also cluster your ASVs and see how things change at successive clustering levels.

-Cheers!

Hi, thank you very much for your suggestion and the articles!

I pitched using ASVs to my supervisor; however, I can only work with OTUs. So my question still stands. :slightly_smiling_face:

Is there an analytical reason why you need to work with OTUs? Knowing this might better help us to help you. We generally advise against using OTUs, but understand why it may be necessary. But the first article I previously linked more eloquently explains why.

Like I said you can generate ASVs, to clean the data. Then, cluster into a variety of OTU thresholds, and see what works best for your data to. Especially, what range of clustering works best for your questions. It could be likely that several thresholds work, say 97% - 98% might provide similar results, and 99% - 100% might be similar. If it turns out that 100% clustering works, then you can stick with ASVs (basically 100% denoised sequences). Perhaps others on the forum might have some related experience and resources they'd like to share in this regard in helping you make a reasonable decision. :slight_smile:

Though, as someone who has experience in invertebrate research, I wanted to point out that the use of ASVs is quite common in bivalve research. Here are some recent examples:

-Cheers!

1 Like

Well, I'm glad you are open for discussion, even if your advisor is not.

All of these are supported in Qiime2, have you tried running them?

de-novo:
https://amplicon-docs.qiime2.org/en/stable/references/plugins/vsearch.html#q2-action-vsearch-cluster-features-de-novo/

closed-ref:
https://amplicon-docs.qiime2.org/en/stable/references/plugins/vsearch.html#q2-action-vsearch-cluster-features-closed-reference/

open-ref:
https://amplicon-docs.qiime2.org/en/stable/references/plugins/vsearch.html#q2-action-vsearch-cluster-features-open-reference/

Let me know if you try running one of them!


As Nick mentioned, denoising has replaced clustering and it's all ASVs now.

But we still have the same question!

When we are first making the features, do we want to start with a database or not?
DADA2 makes de-novo ASVs
deblur denoise-16s makes closed-ref ASVs

1 Like

I’ll also mention that if you did not want to do any denoise-style preprocessing then you’ll need qiime vsearch dereplicate-sequences which will just make a very large table that can then be clustered by the above methods.

3 Likes

Hi, I am sorry for the late response. And once again, thank you for the great response and the publication.:slightly_smiling_face:

I think the reason is that my supervisor never worked with ASVs.

I did as you recommended, ASVs with Deblur, and then clustered them with 97, 98, 99 and 100% tresholds.
As expected, I got the most OTUs for the 100% threshold:

97%: 629

98%: 763

99%: 910

100%: 1,142

And then I checked how many OTUs I have if I keep only the OTUs that appear across several samples, and again, the 100% has the most OTUs. So if I understand it right, there are many more rare OTUs shared across several samples, but also the sparsity increased with 100% threshold.

After this, I run Unweighted PERMANOVA, and the results for all thresholds are very similar.

For my current work, I have to use OTUs. 97% threshold would be easier for me to interpret; nevertheless, 99 or 100% shows finer results. Which makes me wonder which is the best option.

What do you think?

1 Like

Hi @Katerina_Fialova,

Great observation! I will say that there are often a few more data cleaning steps, like removing ASVs / OTUs with poor taxonomy, removing low abundant reads (e.g. singletons), ASVs that only appear in a few samples, etc... You can read about some of the examples here.
Often when I do these additional filtering steps, the ASV / OTU count drop substantially, but the read depth changes much less...

For example, let's say you perform all of additional filtering you might go from 1,100 OTUs down to 450 OTUs, but your total read count might only go from 1,500,000 reads to 1,488,000 reads. That is, many of those reads are in fact the rare (low count) or singleton reads you are observing at 99 - 100% clustering. I've observed this in many data sets.

In general, OTUs over-cluster some taxa and under-cluster others. ASVs can separate single taxa into different units (under-cluster). Different taxa evolve at different rates. Though in many cases OTU clustering is "good enough". So, I think you are fine choosing what makes sense to you given your research questions. One of the papers I linked earlier in this thread (Glassman & Martiny 2018) should help here.

Basically, it comes down which side of the continuum you are most comfortable. From being a "lumper" (OTU clustering) through being a "splitter" (ASVs). Which way would you rather lean towards, given your research questions. But 97% seems quite reasonable to me.

As 97% was originally used as a proxy for an estimate of within species diversity, it's not perfect. Actually, one of the main reasons OTU clustering was commonly used was to help remove noise, prior to denoising algorithms. That is many spurious reads, singletons, etc.. would lump themselves into clusters with other sequences. Clustering is likely more useful for full-length (~1400-1500 base) sequences, as clustering at this cutoff level for short (250 - 450 base) reads will more commonly over-cluster compared to their full length counterparts. That is clustering full length reads at 97% will often give you more OTUs than clustering the same gene sequences using a specific variable region, like V4. As many variable regions may contain the exact same sequence in that region, but differ over the rest of the gene.

Another option is to taxonomically classify all the reads, and collapse the ASVs / OTUs by taxonomy. Though this might result in different groupings of taxonomy-based OTUs depending on the reference database you use.

Sorry for all the information, I just wanted to provide more context. I am sure others have their thoughts on the matter. But to reiterate, if you are comfortable with using 97% or 98% go for it! It seems like your OTU clustering experiments narrowed down a useful range for you. It is also great that you tested for differences and found them to be minimal. Now you can respond to Reviewer 3. :slight_smile:

Anyway, I hope my rambling has helped. :grimacing:

2 Likes

Hi, I am sorry for my late response.

I tried running de novo and closed ref.

I think a de novo method is more suitable for my data because when I run closed. ref. I had many important taxa unmatched and removed.

Thank you for your tip! :slightly_smiling_face:

1 Like

This is great, thank you so much for your answer. It helps me a lot to understand!

I have one additional question.

As mentioned above, DADA2 makes de novo ASVs and Deblur Closed-ref.

First, I ran closed-ref. clustering to get OTUs, and I was missing some important taxa.

When I run De-novo, I didn’t have this problem.

Now I run Deblur, and I no longer miss taxa. Nevertheless, wouldn’t it be better to run DADA2?

After this step, I need to cluster to get OTUs.

That is a tough question to answer. I toggle between the two approaches myself. Some datasets I've analyzed work much better with deblur, others with DADA2. Generally speaking, I think it has to do with how the quality control and denoising occur.

TLDR; See this post.

-Mike

1 Like

Thank you for the post :slightly_smiling_face:

I compared my results after denoising and rerafying;

results Deblur DADA2
Num samples 69 88
Num observations 537 675
Total count 576990 1181051

DADA2 keeps more samples and reads. So I will go with DADA2.

I appreciate your help :slightly_smiling_face: :slightly_smiling_face:

2 Likes