Exclude seqs to attain only 95-97% confidence


I have a question. I just noticed that I missed an intergral part of quality control and I want to make sure that this in-fact needs to be performed, since I have not noticed this step in any of the steps I have used.

I was writing my methodology and realized that my samples do not have a cut-off confidence level. I trained the taxonomy at 99% and initially thought this would automatically parse out those species or taxon which did not meet the cutoff value. I now realize that that was not the case. I did notice that there is a –p-confidence option in classify-sklearn, but it states that it is not used to apply it to limit the taxonomy

My question is, is using qiime quality-control exclude-seqs the way of achieving this? If so, can you tell me what data goes in the following parameters?

qiime quality-control exclude-seqs
–i-query-sequences ______
–i-reference-sequences ______
–p-method blast
–p-perc-identity 0.95
–output-dir Cutoff

If this is not the method and or if no other step in the “moving pictures tutorial” accomplishes this, can you tell me how I can do this.


1 Like

Hello again Fabiola,

I think there might be multiple ways of answering this question.

Confident of what? That the taxonomy assignment is correct, or that the reads have <1 errors, or that the rep-seqs.fna are each >3% different?

Different steps allow us to calculate confidence in different ways, so we need to pick a specific step to perform this calculation.

Does that make sense? What part in your pipeline to you want to set a cut-off level?


Well to be honest with you, I am not 100% sure where this step goes. I feel like its either during taxonomy ID and or after to keep only those taxa ID to a specific taxonomy with a set confidence level.

Based on what the options, I would say that taxonomy is correct? Or what do you suggest?

Can you tell me the difference between the other two?

Maybe we should work backwards; when have you seen confidence reported before? Maybe we can use that as a point of reference and see if we can report a similar number.

For what it’s worth, here are two of my papers (webpage, PDF) on which I was the informatics point of contact. I never attempted to report confidence, but I do list other properties of the processing pipeline including clustering threshold, databases used, and statistical tests performed. I do my best to report the steps I performed so that others can replicate the results, but I don’t try to list every possible setting…

Then don’t report it (unless reviewer 3 asks you about it!) :wink:



I’ve seen it in a few Ectomycorrhizal papers, where they say something along the lines of, the OTU table was created using a 95% similarity or 97% etc.

, For example, this one:
All analyses were based on the 95% similarity OTU table parsed for ECM fungal taxa. (I performed the EcM parsing using FUNGuild) so I don’t think she’s referring to that, as her supplementary info goes into more detail, stating

“After making the OTU table in usearch, we used the assign_taxonomy.py command in QIIME (Caporaso et al 2010) to assign taxonomy based on the same UNITE database. The resulting OTU table based on 95% similarity cut-off included 4,393,501 sequences from a total of 70 samples, and yielded 569 total OTUs.”

If you have any ideas, please let me know, in the meantime, I will look over your papers.

Okay so I saw that on your papers you have “these sequences were clustered using 95%…”
Question: You have the greenegenes references, so in my case, I can just say that my OTU table was clustered at 99% similarity?

So sorry for all the “dumb” questions, but the more I look into this the more confused I get.

Hi @Fabs,
The % similarity you are describing is the % similarity used for clustering OTUs — e.g., to cluster your input sequences at 97% similarity to construct OTUs.

My guess is that you used deblur or dada2 to denoise your sequences — in which case you do not need to perform clustering. Your sequences are effectively 100% OTUs (that have been denoised).

No — that is entirely separate. The reference sequences here do consist of OTUs, but they have been clustered mostly to dereplicate the database, making it easier/faster/less memory intensive to use, while still retaining enough information to, e.g., identify species. You will report something along the lines of:

“Sequences were denoised using the q2-dada2 plugin (citation) with default parameters. ASVs were classified taxonomically using the classify-sklearn method in q2-feature-classifier plugin (citation) for QIIME 2 (citation), using default parameters; the UNITE database (release number) (citation) clustered at X % similarity was used as reference sequences for taxonomy classification.”

So to clarify:

OTU clustering is NOT necessary, since you are denoising your sequences instead.

I hope that clarifies!


Thank you for the great explanation! So to clarify and go back to the original question. I do not need to filter my taxonomy to a confidence level, correct? Just state the % similarity as you stated?



The confidence levels are actually used to decide how to classify your sequences (e.g., can you confidently classify at species level? genus? etc). But that’s another story for another day, and nothing you need to report in your paper (only note the confidence setting if you manually adjusted this — otherwise just report default parameters and the QIIME 2 version #).

1 Like

Perfect, thank you!

I was worried that I would have to rerun my analysis, but you guys just made my data better.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.