OTU clustering and table filtering


I have few basic questions

  1. what is the difference between OTUs and ASVs?
  2. what is the threshold value for otus clustering in Qiime2, is it 99% or 97%?
  3. how to do we decide p-min-frequency value while filtering otu table in "qiime feature-table filter-features " command?


I’ll take a swing at your questions, though it’s worth pointing out that at least some are Googleable.

  1. OTUs (Operational Taxonomic Units) are clustered sequences; ASVs (Amplicon Sequence Variants) maintain single-nucleotide resolution rather than agglomerating/clustering sequences. See e.g. https://www.nature.com/articles/ismej2017119.

  2. The most recent denoising tools in QIIME 2 (deblur and DADA) are using ASVs, I’m pretty sure, so your sequences aren’t clustered unless you cluster them yourself. I think—I’m no expert on this one.

  3. Advice I got from Amplicon SOP v2 (qiime2 2020.8) · LangilleLab/microbiome_helper Wiki · GitHub suggests you should find the mean depth, multiply the mean by 0.001, and let the product be your minimum. But doubtless there are other approaches from more experienced minds.

I hope this helped, Nisha.


Good answers @wburgess ! Just to add to your answer of this question:

@wburgess is right that denoising methods are recommended now, and a couple are integrated in QIIME 2... but OTU clustering is still possible using the q2-vsearch plugin (see documentation and tutorials at qiime2.org). The clustering threshold can be whatever you want it to be — see the documentation for usage details.

I hope that helps @Nisha !


thanks for your immediate response.

for question no 3, does this mean depth indicate “Mean frequency per sample”? and why do we need to multiply it with 0.001 only?

I have used silva DB to assign taxonomy to my sequences and apart from many unassigned species, it also has unusual classification at the species level; eg: “Prevotella sp.”, “Uncultured bacteriodales”, “Human gut”, “Gut metagenome” etc.

I couldn’t interpret these results.

If you follow the link I wrote with that answer, a rationale is given: "One possible choice would be to remove all ASVs that have a frequency of less than 0.1% of the mean sample depth. This cut-off excludes ASVs that are likely due to MiSeq bleed-through between runs (reported by Illumina to be 0.1% of reads)."

Depth and frequency here do, a little counterintuitively I think, mean the same thing. But depth per sample varies per sample; the advice I got myself and gave to you was to use 0.1% of the mean. Consider the following screenshot:

The mean frequency/depth is 52428. So my default would be cutting below a minimum of 52.

[Edit: to be absolutely clear, this screenshot is an example from a run of mine. 52 is the cutoff for that run, based on my data, not a default minimum for others. One needs to use the mean for one's own run.]

In my dataset, this value is more than the median value. so am I not be losing many sequences?

Are you saying that 0.1% of the mean is greater than the median? Or are you saying that the mean is greater than the median? If (mean * 0.001) > median, then your data has bigger problems to address. If mean > median, I think you can understand the basic statistics yourself.

If you’re worried, I suggest that you run the code with various choices of minimum frequency, and see for yourself how many samples you lose.

sorry, it is (mean * 0.001) > median.
if I take the median value as p-min-frequency, I’m getting 3429 features, and with a “mean * 0.001” , I’m left with 2981 features.

