sklearn classifier, confidence and fuctional genes

Hello everyone,

I am using the classifier to assign my nifH data (nitrogen fixation)

qiime feature-classifier classify-sklearn \

–i-reads rep-seqs-dada2.qza \

–i-classifier classifier2017.qza \

–o-classification taxonomy_2017.qza

However, it could not identify and 3 of my samples are >90 % unidentified. It has been suggested to lower confidence value than 0.7 (the default) but only in 0.3 I am staring to get results that I desired before. I don’t feel I trust the data so I’ve read the paper:

Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin

The authors suggest to use the default for 16S but what about the functional genes? Should I lower the values?

Thank you very much

Hi @EGvibrio,

Long story short: this has not been benchmarked, so getting a real answer to this will require some validation.

Fortunately, based on the details you have given there are a couple good options:

Here are some data from a nifH mock community that you could use to do a quick benchmark. A single mock community might not tell you a great deal, but it would give you a sense of if 0.3 is a reasonable confidence setting.

Basically, you would:

  1. download the mock community data (links to the data are in the dataset metadata file at that link)
  2. process with your analysis pipeline
  3. evaluate classification accuracy by comparing your observed versus expected taxonomy. Observed is the output of your pipeline. The expected taxonomy is listed in the "source" directory at the link above. However, that taxonomy just lists the taxonomic lineages expected by the source contributor and probably are not formatted to match the taxonomic names used in your reference database, so might require a bit of reformatting locally to make sure that the expected taxonomies actually match your reference!

Alternatively, if your database is reasonably complete (i.e., represents most of the diversity that you expect to find in nature) you could use RESCRIPt to test classification accuracy via simulation. See here for details:

Good luck!

3 Likes

Thank you @Nicholas_Bokulich. I will try and will let you know.

Edit: The mockup files do not support the QIIME2 Since I don’t have the fastaq files.

Thanks again

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.