For QIIME 2 2024.5, does the order of the sequences matter for feature-classifier classify-sklearn?

This appears to be something new I've noticed with QIIME 2 2024.5.

When I build a classifier from HOMD (https://homd.org/), I have noticed that the order of the
rep seqs matters for the resulting classified.

Basically, I build the classifer (build-HOMD-v4.sh, attached). Then apply it to the rep seqs. And
they are mostly classified only as k__Bacteria. I know this is not what I used to get with older
versions of QIIME 2.

But if I export the fasta sequences, then sort them by their sequence id, then import them and
re-classify, I actually get the results I am suspecting (and which I used to get with older versions.)

REPORT.zip (57.9 KB)

I've attached everything needed to reproduce. Basically, sh eg.sh, that will download the data
necessary to build the classifier, build the classifier. Then run the classification initially, and
after sorting (class-reorder-class.sh). Then it dumps the first ten calls for each classification.

I've noticed that this also happens with MOMD (https://momd.org/), but not with GreenGenes2
or Silva. Maybe because HOMD and MOMD are so much smaller?

Any thoughts?

1 Like

Hello @roachjm-unc,

Welcome to the forums! :qiime2:

Thank you for including your full pipeline. I took a look and summarized the differences here:

  • the same database homd-15.23-515-806-nb-q2-2024.5.qza is used throughout
  • the same reads are used in orig-rep-seqs.qza and rep-seqs-sorted.qza... just the second one is sorted

(Am I understanding this correctly? Please correct any mistakes!)

So everything is the same. And yet!

The predicted taxonomy of each sequence should be independent and stable, with other sequences and their order making no difference. So this looks like a bug!

One of the Mods or Staff will try to reproduce and report back here!

Thank you for brining this to our attention.

I've reproduced the issue:

$ head -n 5 rep-seqs-calls-homd.txt
002ffec53cd196b4bc3249d63d0eab78        k__Bacteria     0.9999999964891092
005d517ab8f6efa378a09d105f8945ef        k__Bacteria     0.9999999898922893
00a26b2be8bc19e5315b238405f92e27        k__Bacteria     0.9999999900657376
00a31fd31eecee0c36145ffef0e1f9e3        k__Bacteria     0.9999999995496995
00a96fbd0ac34bfd245f9c24f8737f7d        k__Bacteria     0.9999999974907293

$ head -n 5 rep-seqs-sorted-calls-homd.txt
002ffec53cd196b4bc3249d63d0eab78        k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales    0.999993050730017
005d517ab8f6efa378a09d105f8945ef        k__Bacteria;p__Bacteroidetes;c__Flavobacteriia;o__Flavobacteriales;f__Weeksellaceae;g__Chryseobacterium;s__sp._HMT_319                                                                 0.9973339704566835
00a26b2be8bc19e5315b238405f92e27        k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas                                                                             0.9999952165692579
00a31fd31eecee0c36145ffef0e1f9e3        k__Bacteria     0.9999999887996882
00a96fbd0ac34bfd245f9c24f8737f7d        k__Bacteria;p__Firmicutes;c__Clostridia;o__Eubacteriales;f__Lachnospiraceae   0.9999634486318909

@roachjm-unc,

I found the issue with help from @Nicholas_Bokulich

By default, the feature-classifier classify-sklearn command will auto detect the orientation of the reads compared to the database.

--p-read-orientation TEXT Choices('same', 'reverse-complement', 'auto')
  Direction of reads with respect to reference
  sequences. same will cause reads to be classified
  unchanged; reverse-complement will cause reads to be
  reversed and complemented prior to classification.
  "auto" will autodetect orientation based on the
  confidence estimates for the first 100 reads.
    [default: 'auto']

Here, that auto-detection fails on the unsorted reads. :crying_cat_face:

Passing --p-read-orientation 'same' to classify-sklearn produces the results you expected!

Full writeup here: q2-forums/31229 at main · colinbrislawn/q2-forums · GitHub

Thank you for providing this extremely reproducible data set. This was super helpful to me!

1 Like

@colinbrislawn and @Nikolas_Bokulich,

Thank you for finding the solution so quickly. It would have taken me a while to figure that out.

This is really helpful. Thanks again.

1 Like