F-metric score in naive bayes classification of self-downloaded HROM microbiome database

jlco · January 31, 2026, 12:00pm

Hey everyone.

I got a question regarding the naive bayes classifier based on the self-downloaded HROM microbiome dataset.

hrom-classifier-V1V3.qza (1.4 MB)
hrom-classifier-V3V4.qza (1.1 MB)

I have created these two classifier for taxonomy classification of microbiome in species level within V3-V4 and V1-V3 primers respectively. However, when I evaluate the classifier performance, I got this:

hrom-V1V3-eval.qzv (437.8 KB)
hrom-V3V4-eval.qzv (437.7 KB)

I was wondering whether the moderate F-measure score (phylum and below: <73 score) is due to the huge amount of ID replicates in the HROM sequence and taxonomy file.

hrom-taxonomy.tsv (662.2 KB)

Thank you everyone! I appreciate any help.

colinbrislawn · January 31, 2026, 4:57pm

Hello Jason,

Thank you for sharing your HROM results. I was also working on some rescript evals and had similar questions.

HROM V1V3 eval:

HROM V3V4 eval:

I think something is wrong, because that f-measure is flat across taxonomy levels. Because there are more taxa labels at lower levels, this means entropy increases at these levels and f-score decreases.

Here's an example of that from the UNITE database:

When there are fewer classification categories, it's easier to get them all right!

Yeah, 73% is low for phylum!

I checked for replicates, and did find a lot:

sh-3.2$ grep -c "g__Rothia" hrom-taxonomy.tsv # example genus
196
sh-3.2$ grep -c "s__Rothia *" hrom-taxonomy.tsv # matching species
153
sh-3.2$ grep -c "s__Rothia mucilaginosa" hrom-taxonomy.tsv # genus + species
98
sh-3.2$ grep -c "s__Rothia mucilaginosa_B" hrom-taxonomy.tsv # genus + species + strain
24

sh-3.2$ grep -c "s__Rothia sp*" hrom-taxonomy.tsv  # genus + unknown species
52

jlco · January 31, 2026, 5:02pm

So, would you suggest to ensure that only one hrom-id can be stayed, but not the others of the same kind? If so, how would you do?

Personally, my initial thought is to extract ones with the longest sequence.

colinbrislawn · February 2, 2026, 9:53pm

The artifact provenance shows lca instead of uniq when dereplicating. Could this be flattening out those curves?

qiime rescript dereplicate --p-mode uniq ...

Also consider --p-mode super if the database lineages are strictly hierarchical to prevent hybrid taxonomies. (super it will return the most commonly assigned taxonomy per level of rank. Which is normally not an issue for properly curated taxonomies.)

I'm also investigating classifier training now so I very much appreciate the discussion.

jlco · February 3, 2026, 5:08am

Try both, but the result is worsen. So, someone suggested me to subset the longest sequences of same id, which I got F1 0.9.

colinbrislawn · February 3, 2026, 5:15am

Can you share the command you ran to do this?

EDIT: or upload a QZA file of the result? Maybe you did this without using Qiime2?

jlco · February 3, 2026, 8:33am

MACHINE_LEARNING_PROJECT_1.zip (5.3 MB)
Please check the classifiers, results and bash script inside the folder. Note that I used QIIME2 rescript.

colinbrislawn · February 3, 2026, 3:30pm

I can't see the part that does this in the two .sh files. Am I missing something? Can you point me to a line?

jlco · February 3, 2026, 3:57pm

My script does not include this as the selection process is done in private .

Anyway, the case is not the code for subsetting the longest sequence length of the same species. It is the approach whether it is acceptable or not.

colinbrislawn · February 3, 2026, 4:14pm

Like, it's a secret? (Some of my work is under NDA, it's okay.)

Sure, the validity of the method is separate from the quality of the code.

Robert Edgar has written up the problems with longer and shorter database sequences here. He argues that full coverage from global alignment is needed, meaning that this method will not work: USEARCH manual
USEARCH