Hi there, I am reporting this as a potential bug but hoping there is a workaround in the code that I'm simply unaware of.
When using QIIME2
qiime feature-classifier fit-classifier-naive-bayes and then
qiime feature-classifier classify-sklearn to classify taxonomy, the classifier automatically returns a Kingdom assignment even if the sequence is not assigned at the Kingdom level. I have tested this on both 16S using Silva and ITS using the UNITE databases.
This is potentially a huge problem as there is no way to determine which sequences are not assigned to the Kingdom level, and I still run into many papers where taxa are not filtered by unassigned Phyla (perhaps this could be added downstream to catch this error). This would result in massively inflated taxa estimates and erroneous taxonomic assignments, such as microeukaryotes picked up by ITS2 sequencing that were erroneously assigned to the fungal kingdom by QIIME.
Thoughts? Is there a flag on the classify-sklearn that avoids this issue?
Hi @evs_017, welcome to !
Can you clarify this statement?
Can you provide some explicit examples.
I agree, this is disheartening.
If you search the forum, there are many threads and discussions about making sure that users have sufficient "out-group" or "decoy" sequences / taxa, within the reference database. In fact, this is why we often recommend that the full "eukaryote" UNITE database be used. When using the "fungi only" reference database, it is common for non-fungal taxa to be classified as Fungi, simply because there is nothing other than fungi to compare your sequences to. In other words, the sequence would have to be a very bad match to return as "Unclassified". This can happen with any reference database. For example, an SSU reference database w/o eukaryotes, can erroneously classify eukaryotes as bacteria or archaea.
Again, provide some examples for your earlier questions, but the later issue is simply having an understanding of how a given reference database is curated and prepared. For example, no amount of parameter settings will help when there are not sufficient out-group taxa, or even good reference taxa, in your reference database.