sklearn classifier, confidence and fuctional genes

Nicholas_Bokulich · February 5, 2021, 7:30am

Long story short: this has not been benchmarked, so getting a real answer to this will require some validation.

Fortunately, based on the details you have given there are a couple good options:

Here are some data from a nifH mock community that you could use to do a quick benchmark. A single mock community might not tell you a great deal, but it would give you a sense of if 0.3 is a reasonable confidence setting.

Basically, you would:

download the mock community data (links to the data are in the dataset metadata file at that link)
process with your analysis pipeline
evaluate classification accuracy by comparing your observed versus expected taxonomy. Observed is the output of your pipeline. The expected taxonomy is listed in the "source" directory at the link above. However, that taxonomy just lists the taxonomic lineages expected by the source contributor and probably are not formatted to match the taxonomic names used in your reference database, so might require a bit of reformatting locally to make sure that the expected taxonomies actually match your reference!

Alternatively, if your database is reasonably complete (i.e., represents most of the diversity that you expect to find in nature) you could use RESCRIPt to test classification accuracy via simulation. See here for details:

Good luck!