confidence filtering pre or post classifier?

Hello all!

I’m at the point in my pipeline of assigning taxonomy to my sample of ITS sequences and filtering. I used a UNITE classifier generously provided by Colin Brislawn, and subsequently would like to filter the sequences to only those assigned k__Fungi with a confidence of 90%. I know how to filter based on k__Fungi, but cannot seem to find the syntax to filter based on confidence of assignment. Is there a way to do this after classifying using “qiime taxa filter-table” or “qiime taxa filter-seqs”? Or do I need to train my own UNITE classifier and specify “–p-confidence .9” in the training?

If it’s helpful to know– I’m working with soil samples, so we expect to find a number of non-fungal organisms; I’m mainly working on my university’s cluster system (which doesn’t have rescript in their qiime2 software); and filtering to 90% confidence assignments will still keep 5,700 assigned sequences out of 8,800 (~65% of the total).

Thanks for any and all help!

1 Like

Hello!

You can set the minimum confidence when running your own classification:

qiime feature-classifier classify-sklearn \
  --i-reads reads.qza \
  --i-classifier classifier.qza \
  --p-confidence 0.9 \
  ...

(There is also an equivalent / similar threshold when building the database, but it's related to something different.)

There should be a confidence column in your already made taxonomy.qza files,
so filtering afterward should work. But I'm not sure how best to do that from within Qiime2.

Here is the .tsv file inside a taxonomy.qza file, and you can see the confidence column right there!

Feature ID Taxon Confidence
f6a6a623176a0592f0a07f35d9755256 k__Bacteria 0.9863340470419281
5d4cda3de7e31f1970fb6fb06332dc99 k__Bacteria; p__Chloroflexi; c__Anaerolineae; o__CFB-26; f__; g__; s__ 0.8885389338468193
015f67f8c901c757442ef3fd1dd57abb k__Bacteria; p__Chloroflexi; c__Anaerolineae 0.7179707914457395
9639a3291729a3758207b47715d9205f k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae 0.7481609882076466
c48070f3061b086b60ff32f77e0002fa k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Faecalibacterium; s__prausnitzii 0.9882515061314097

I think including and using the taxonomy confidence is a great idea!

In my own work, I avoid thresholds because they are arbitrary; why not 99% confidant? Why not >51% confidant?

(It's a trap! There's no 'right' way to bin a continuous variable, so Reviewer 3 is going to ask for a different threshold anyway!)