I’m at the point in my pipeline of assigning taxonomy to my sample of ITS sequences and filtering. I used a UNITE classifier generously provided by Colin Brislawn, and subsequently would like to filter the sequences to only those assigned k__Fungi with a confidence of 90%. I know how to filter based on k__Fungi, but cannot seem to find the syntax to filter based on confidence of assignment. Is there a way to do this after classifying using “qiime taxa filter-table” or “qiime taxa filter-seqs”? Or do I need to train my own UNITE classifier and specify “–p-confidence .9” in the training?
If it’s helpful to know– I’m working with soil samples, so we expect to find a number of non-fungal organisms; I’m mainly working on my university’s cluster system (which doesn’t have rescript in their qiime2 software); and filtering to 90% confidence assignments will still keep 5,700 assigned sequences out of 8,800 (~65% of the total).
(There is also an equivalent / similar threshold when building the database, but it's related to something different.)
There should be a confidence column in your already made taxonomy.qza files,
so filtering afterward should work. But I'm not sure how best to do that from within Qiime2.
Here is the .tsv file inside a taxonomy.qza file, and you can see the confidence column right there!
There is no need to filter based on confidence. The confidence scores (i.e., probability scores) are used internally by the classifier to decide whether classification at a given rank is sufficient. So if a given classification’s probability falls below the user-specified threshold (0.7 by default), classification is attempted at the next rank up.
So you should instead decide on the threshold that is suitable for your marker gene (the 0.7 default was based on benchmarking both 16S and ITS so should work for you) — then let the classifier do the rest. I would filter any taxa that fail to classify to, e.g., class level (usually these are non-target if they cannot classify at least that far, but you should always double check). But I would not filter based on the confidence scores, as these are used to determine depth of assignment, not whether a given sequence should be retained or not. (there could be real fungal ITS seqs that just do not have a good hit in the database and hence confidence could be lower, but those are not grounds to discard potentially real signal!)
Thank you so much for the help! I somehow misunderstood the confidence filter to be at the training phase instead of the classifying stage (though as it’s designed makes a LOT more sense).
Both you and Nicholas are right that a confidence threshold is ultimately arbitrary and setting it this high might (or perhaps, will likely) exclude some hits that are true fungi. But this is the starting point that my lab has decided on. Between lab members and reviewers, I’m sure I’ll eventually get to a pipeline that’s broadly acceptable.