This appears to be something new I've noticed with QIIME 2 2024.5.
When I build a classifier from HOMD (https://homd.org/), I have noticed that the order of the
rep seqs matters for the resulting classified.
Basically, I build the classifer (build-HOMD-v4.sh, attached). Then apply it to the rep seqs. And
they are mostly classified only as k__Bacteria. I know this is not what I used to get with older
versions of QIIME 2.
But if I export the fasta sequences, then sort them by their sequence id, then import them and
re-classify, I actually get the results I am suspecting (and which I used to get with older versions.)
I've attached everything needed to reproduce. Basically, sh eg.sh, that will download the data
necessary to build the classifier, build the classifier. Then run the classification initially, and
after sorting (class-reorder-class.sh). Then it dumps the first ten calls for each classification.
I've noticed that this also happens with MOMD (https://momd.org/), but not with GreenGenes2
or Silva. Maybe because HOMD and MOMD are so much smaller?
Thank you for including your full pipeline. I took a look and summarized the differences here:
the same database homd-15.23-515-806-nb-q2-2024.5.qza is used throughout
the same reads are used in orig-rep-seqs.qza and rep-seqs-sorted.qza... just the second one is sorted
(Am I understanding this correctly? Please correct any mistakes!)
So everything is the same. And yet!
The predicted taxonomy of each sequence should be independent and stable, with other sequences and their order making no difference. So this looks like a bug!
One of the Mods or Staff will try to reproduce and report back here!
By default, the feature-classifier classify-sklearn command will auto detect the orientation of the reads compared to the database.
--p-read-orientation TEXT Choices('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference
sequences. same will cause reads to be classified
unchanged; reverse-complement will cause reads to be
reversed and complemented prior to classification.
"auto" will autodetect orientation based on the
confidence estimates for the first 100 reads.
[default: 'auto']
Here, that auto-detection fails on the unsorted reads.
Passing --p-read-orientation 'same' to classify-sklearn produces the results you expected!