confidence filtering pre or post classifier?

Hello all!

I’m at the point in my pipeline of assigning taxonomy to my sample of ITS sequences and filtering. I used a UNITE classifier generously provided by Colin Brislawn, and subsequently would like to filter the sequences to only those assigned k__Fungi with a confidence of 90%. I know how to filter based on k__Fungi, but cannot seem to find the syntax to filter based on confidence of assignment. Is there a way to do this after classifying using “qiime taxa filter-table” or “qiime taxa filter-seqs”? Or do I need to train my own UNITE classifier and specify “–p-confidence .9” in the training?

If it’s helpful to know– I’m working with soil samples, so we expect to find a number of non-fungal organisms; I’m mainly working on my university’s cluster system (which doesn’t have rescript in their qiime2 software); and filtering to 90% confidence assignments will still keep 5,700 assigned sequences out of 8,800 (~65% of the total).

Thanks for any and all help!

1 Like

Hello!

You can set the minimum confidence when running your own classification:

qiime feature-classifier classify-sklearn \
  --i-reads reads.qza \
  --i-classifier classifier.qza \
  --p-confidence 0.9 \
  ...

(There is also an equivalent / similar threshold when building the database, but it's related to something different.)

There should be a confidence column in your already made taxonomy.qza files,
so filtering afterward should work. But I'm not sure how best to do that from within Qiime2.

Here is the .tsv file inside a taxonomy.qza file, and you can see the confidence column right there!

Feature ID Taxon Confidence
f6a6a623176a0592f0a07f35d9755256 k__Bacteria 0.9863340470419281
5d4cda3de7e31f1970fb6fb06332dc99 k__Bacteria; p__Chloroflexi; c__Anaerolineae; o__CFB-26; f__; g__; s__ 0.8885389338468193
015f67f8c901c757442ef3fd1dd57abb k__Bacteria; p__Chloroflexi; c__Anaerolineae 0.7179707914457395
9639a3291729a3758207b47715d9205f k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae 0.7481609882076466
c48070f3061b086b60ff32f77e0002fa k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Faecalibacterium; s__prausnitzii 0.9882515061314097

I think including and using the taxonomy confidence is a great idea!

In my own work, I avoid thresholds because they are arbitrary; why not 99% confidant? Why not >51% confidant?

(It's a trap! There's no 'right' way to bin a continuous variable, so Reviewer 3 is going to ask for a different threshold anyway!)

1 Like

Hi @kbmac ,

There is no need to filter based on confidence. The confidence scores (i.e., probability scores) are used internally by the classifier to decide whether classification at a given rank is sufficient. So if a given classification’s probability falls below the user-specified threshold (0.7 by default), classification is attempted at the next rank up.

So you should instead decide on the threshold that is suitable for your marker gene (the 0.7 default was based on benchmarking both 16S and ITS so should work for you) — then let the classifier do the rest. I would filter any taxa that fail to classify to, e.g., class level (usually these are non-target if they cannot classify at least that far, but you should always double check). But I would not filter based on the confidence scores, as these are used to determine depth of assignment, not whether a given sequence should be retained or not. (there could be real fungal ITS seqs that just do not have a good hit in the database and hence confidence could be lower, but those are not grounds to discard potentially real signal!)

3 Likes

Thank you so much for the help! I somehow misunderstood the confidence filter to be at the training phase instead of the classifying stage (though as it’s designed makes a LOT more sense).

Both you and Nicholas are right that a confidence threshold is ultimately arbitrary and setting it this high might (or perhaps, will likely) exclude some hits that are true fungi. But this is the starting point that my lab has decided on. Between lab members and reviewers, I’m sure I’ll eventually get to a pipeline that’s broadly acceptable.

1 Like