The most recent denoising tools in QIIME 2 (deblur and DADA) are using ASVs, I'm pretty sure, so your sequences aren't clustered unless you cluster them yourself. I think---I'm no expert on this one.
Good answers @wburgess ! Just to add to your answer of this question:
@wburgess is right that denoising methods are recommended now, and a couple are integrated in QIIME 2... but OTU clustering is still possible using the q2-vsearch plugin (see documentation and tutorials at qiime2.org). The clustering threshold can be whatever you want it to be — see the documentation for usage details.
I have used silva DB to assign taxonomy to my sequences and apart from many unassigned species, it also has unusual classification at the species level; eg: "Prevotella sp.", "Uncultured bacteriodales", "Human gut", "Gut metagenome" etc.
If you follow the link I wrote with that answer, a rationale is given: "One possible choice would be to remove all ASVs that have a frequency of less than 0.1% of the mean sample depth. This cut-off excludes ASVs that are likely due to MiSeq bleed-through between runs (reported by Illumina to be 0.1% of reads)."
Depth and frequency here do, a little counterintuitively I think, mean the same thing. But depth per sample varies per sample; the advice I got myself and gave to you was to use 0.1% of the mean. Consider the following screenshot:
The mean frequency/depth is 52428. So my default would be cutting below a minimum of 52.
[Edit: to be absolutely clear, this screenshot is an example from a run of mine. 52 is the cutoff for that run, based on my data, not a default minimum for others. One needs to use the mean for one's own run.]
Are you saying that 0.1% of the mean is greater than the median? Or are you saying that the mean is greater than the median? If (mean * 0.001) > median, then your data has bigger problems to address. If mean > median, I think you can understand the basic statistics yourself.
If you're worried, I suggest that you run the code with various choices of minimum frequency, and see for yourself how many samples you lose.
sorry, it is (mean * 0.001) > median.
if I take the median value as p-min-frequency, I'm getting 3429 features, and with a "mean * 0.001" , I'm left with 2981 features.