OTU clustering and table filtering

Nisha · May 17, 2021, 4:39am

Hi

I have few basic questions

what is the difference between OTUs and ASVs?
what is the threshold value for otus clustering in Qiime2, is it 99% or 97%?
how to do we decide p-min-frequency value while filtering otu table in "qiime feature-table filter-features " command?

Best
Nisha

wburgess · May 17, 2021, 12:51pm

I'll take a swing at your questions, though it's worth pointing out that at least some are Googleable.

OTUs (Operational Taxonomic Units) are clustered sequences; ASVs (Amplicon Sequence Variants) maintain single-nucleotide resolution rather than agglomerating/clustering sequences. See e.g. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis | The ISME Journal.
The most recent denoising tools in QIIME 2 (deblur and DADA) are using ASVs, I'm pretty sure, so your sequences aren't clustered unless you cluster them yourself. I think---I'm no expert on this one.
Advice I got from Amplicon SOP v2 (qiime2 2020.8) · LangilleLab/microbiome_helper Wiki · GitHub suggests you should find the mean depth, multiply the mean by 0.001, and let the product be your minimum. But doubtless there are other approaches from more experienced minds.

I hope this helped, Nisha.

Nicholas_Bokulich · May 17, 2021, 1:04pm

Good answers @wburgess ! Just to add to your answer of this question:

@wburgess is right that denoising methods are recommended now, and a couple are integrated in QIIME 2... but OTU clustering is still possible using the q2-vsearch plugin (see documentation and tutorials at qiime2.org). The clustering threshold can be whatever you want it to be — see the documentation for usage details.

I hope that helps @Nisha !

Nisha · May 18, 2021, 4:46am

thanks for your immediate response.

for question no 3, does this mean depth indicate "Mean frequency per sample"? and why do we need to multiply it with 0.001 only?
Best
Nisha

Nisha · May 18, 2021, 7:37am

I have used silva DB to assign taxonomy to my sequences and apart from many unassigned species, it also has unusual classification at the species level; eg: "Prevotella sp.", "Uncultured bacteriodales", "Human gut", "Gut metagenome" etc.

I couldn't interpret these results.

wburgess · May 18, 2021, 11:36am

If you follow the link I wrote with that answer, a rationale is given: "One possible choice would be to remove all ASVs that have a frequency of less than 0.1% of the mean sample depth. This cut-off excludes ASVs that are likely due to MiSeq bleed-through between runs (reported by Illumina to be 0.1% of reads)."

Depth and frequency here do, a little counterintuitively I think, mean the same thing. But depth per sample varies per sample; the advice I got myself and gave to you was to use 0.1% of the mean. Consider the following screenshot:

The mean frequency/depth is 52428. So my default would be cutting below a minimum of 52.

[Edit: to be absolutely clear, this screenshot is an example from a run of mine. 52 is the cutoff for that run, based on my data, not a default minimum for others. One needs to use the mean for one's own run.]

Nisha · May 19, 2021, 9:07am

In my dataset, this value is more than the median value. so am I not be losing many sequences?

wburgess · May 19, 2021, 12:15pm

Are you saying that 0.1% of the mean is greater than the median? Or are you saying that the mean is greater than the median? If (mean * 0.001) > median, then your data has bigger problems to address. If mean > median, I think you can understand the basic statistics yourself.

If you're worried, I suggest that you run the code with various choices of minimum frequency, and see for yourself how many samples you lose.

Nisha · May 20, 2021, 10:55am

sorry, it is (mean * 0.001) > median.
if I take the median value as p-min-frequency, I'm getting 3429 features, and with a "mean * 0.001" , I'm left with 2981 features.

system · June 20, 2021, 4:55pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.