I have created these two classifier for taxonomy classification of microbiome in species level within V3-V4 and V1-V3 primers respectively. However, when I evaluate the classifier performance, I got this:
I was wondering whether the moderate F-measure score (phylum and below: <73 score) is due to the huge amount of ID replicates in the HROM sequence and taxonomy file.
I think something is wrong, because that f-measure is flat across taxonomy levels. Because there are more taxa labels at lower levels, this means entropy increases at these levels and f-score decreases.
Here's an example of that from the UNITE database:
The artifact provenance shows lca instead of uniq when dereplicating. Could this be flattening out those curves?
qiime rescript dereplicate --p-mode uniq ...
Also consider --p-mode super if the database lineages are strictly hierarchical to prevent hybrid taxonomies. (super it will return the most commonly assigned taxonomy per level of rank. Which is normally not an issue for properly curated taxonomies.)
I'm also investigating classifier training now so I very much appreciate the discussion.
Like, it's a secret? (Some of my work is under NDA, it's okay.)
Sure, the validity of the method is separate from the quality of the code.
Robert Edgar has written up the problems with longer and shorter database sequences here. He argues that full coverage from global alignment is needed, meaning that this method will not work: USEARCH manual USEARCH