qiime2 feature-classifier classify-sklearn reads' orientation detetion problem

Dear QIIME2 developers team!
Thank you for such a valuable tool as QIIME2. I have a question concerning taxonomic classification. I had strange taxonomy in one of my projects: like all the reads were classified as Archaea in the oral microbiome (using classifier trained on the Silva database version 138). This appeared when the orientation of the reads is set to "auto" in the feature-classifier. When I set the orientation of the reads to "same", I obtained good results. I read that when the "auto" option is set, the orientation is detected based on the 100 first reads. I tried the first 100 reads with "same" and "reverse" options. The result was much better in the case of the "same" option (higher confidence values) ... Then I took a look at the code on the GitHub project page and saw that in the module of reads' orientation detection --p-confidence parameter is set to 0.0, while during final classification it is set to 0.7 by default. When I set --p-confidence to 0 the results became really better with reverse orientation and strange bacteria (mostly g__Deep_Sea_Euryarchaeotic_Group (DSEG); s__uncultured_archaeon). I tried to look at these taxa in the Silva database, but there are a lot of them and at first glance, I did not see anything strange. I would be very grateful if you expressed your ideas why such an effect can be observed. Thank you!

Hi @Natalia_Klimenko , welcome to the forum!

It looks like your reads are in the "same" orientation. Auto-orientation usuaully works well but it looks like it is failing in your case due to the presence of a bad sequence in the database.

The only other time we have seen a similar issue is when reads are in mixed orientations, so that is another possibility, but it looks like instead that "deep sea" group is causing issues.

That sequence is probably either a misannotation in the database or unusually short. See these topics for relevant discussion:

Did you train your own classifier? If so, you might want to check out the RESCRIPt plugin for ways to improve the database before training the classifier (this is what the QIIME 2 pre-trained classifiers use to format the SILVA database and remove low-quality sequences prior to training):

1 Like

Dear Nicholas Bokulich, thank you for the response!
For the classifier training, we used the database version already preprocessed with RESCRIPt from the QIIME2 website. Thank you for the ideas! We will try to search for such bad sequences in our database among "deep sea".

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.