I am using the classifier to assign my nifH data (nitrogen fixation)
qiime feature-classifier classify-sklearn \
–i-reads rep-seqs-dada2.qza \
–i-classifier classifier2017.qza \
However, it could not identify and 3 of my samples are >90 % unidentified. It has been suggested to lower confidence value than 0.7 (the default) but only in 0.3 I am staring to get results that I desired before. I don’t feel I trust the data so I’ve read the paper:
Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin
The authors suggest to use the default for 16S but what about the functional genes? Should I lower the values?
Long story short: this has not been benchmarked, so getting a real answer to this will require some validation.
Fortunately, based on the details you have given there are a couple good options:
Here are some data from a nifH mock community that you could use to do a quick benchmark. A single mock community might not tell you a great deal, but it would give you a sense of if 0.3 is a reasonable confidence setting.
Basically, you would:
download the mock community data (links to the data are in the dataset metadata file at that link)
process with your analysis pipeline
evaluate classification accuracy by comparing your observed versus expected taxonomy. Observed is the output of your pipeline. The expected taxonomy is listed in the “source” directory at the link above. However, that taxonomy just lists the taxonomic lineages expected by the source contributor and probably are not formatted to match the taxonomic names used in your reference database, so might require a bit of reformatting locally to make sure that the expected taxonomies actually match your reference!
Alternatively, if your database is reasonably complete (i.e., represents most of the diversity that you expect to find in nature) you could use RESCRIPt to test classification accuracy via simulation. See here for details: