OTU clustering and table filtering

Hi

I have few basic questions

  1. what is the difference between OTUs and ASVs?
  2. what is the threshold value for otus clustering in Qiime2, is it 99% or 97%?
  3. how to do we decide p-min-frequency value while filtering otu table in "qiime feature-table filter-features " command?

Best
Nisha

I'll take a swing at your questions, though it's worth pointing out that at least some are Googleable.

  1. OTUs (Operational Taxonomic Units) are clustered sequences; ASVs (Amplicon Sequence Variants) maintain single-nucleotide resolution rather than agglomerating/clustering sequences. See e.g. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis | The ISME Journal.

  2. The most recent denoising tools in QIIME 2 (deblur and DADA) are using ASVs, I'm pretty sure, so your sequences aren't clustered unless you cluster them yourself. I think---I'm no expert on this one.

  3. Advice I got from Amplicon SOP v2 (qiime2 2020.8) · LangilleLab/microbiome_helper Wiki · GitHub suggests you should find the mean depth, multiply the mean by 0.001, and let the product be your minimum. But doubtless there are other approaches from more experienced minds.

I hope this helped, Nisha.

2 Likes

Good answers @wburgess ! Just to add to your answer of this question:

@wburgess is right that denoising methods are recommended now, and a couple are integrated in QIIME 2... but OTU clustering is still possible using the q2-vsearch plugin (see documentation and tutorials at qiime2.org). The clustering threshold can be whatever you want it to be — see the documentation for usage details.

I hope that helps @Nisha !

2 Likes

thanks for your immediate response.

for question no 3, does this mean depth indicate "Mean frequency per sample"? and why do we need to multiply it with 0.001 only?
Best
Nisha

I have used silva DB to assign taxonomy to my sequences and apart from many unassigned species, it also has unusual classification at the species level; eg: "Prevotella sp.", "Uncultured bacteriodales", "Human gut", "Gut metagenome" etc.

I couldn't interpret these results.

If you follow the link I wrote with that answer, a rationale is given: "One possible choice would be to remove all ASVs that have a frequency of less than 0.1% of the mean sample depth. This cut-off excludes ASVs that are likely due to MiSeq bleed-through between runs (reported by Illumina to be 0.1% of reads)."

Depth and frequency here do, a little counterintuitively I think, mean the same thing. But depth per sample varies per sample; the advice I got myself and gave to you was to use 0.1% of the mean. Consider the following screenshot:


The mean frequency/depth is 52428. So my default would be cutting below a minimum of 52.

[Edit: to be absolutely clear, this screenshot is an example from a run of mine. 52 is the cutoff for that run, based on my data, not a default minimum for others. One needs to use the mean for one's own run.]

1 Like

In my dataset, this value is more than the median value. so am I not be losing many sequences?

Are you saying that 0.1% of the mean is greater than the median? Or are you saying that the mean is greater than the median? If (mean * 0.001) > median, then your data has bigger problems to address. If mean > median, I think you can understand the basic statistics yourself.

If you're worried, I suggest that you run the code with various choices of minimum frequency, and see for yourself how many samples you lose.

sorry, it is (mean * 0.001) > median.
if I take the median value as p-min-frequency, I'm getting 3429 features, and with a "mean * 0.001" , I'm left with 2981 features.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.