Dear Qiime 2 help desk,
I met a strange problem using a UNITE ITS-trained classifier in Qiime 2, i.e. some sequences are classified with significantly lower resolutions if they are submitted with other sequences than that they are submitted alone!
I picked up two sequences to make a very simple example:
When individually submitted, they were both classified as: k__Fungi;p__Basidiomycota;c__Tremellomycetes;o__Tremellales;f__Tremellaceae;g__Cryptococcus;s__Cryptococcus_neoformans
However, sequence "123830878d97229ae38d4d57ca68335c" was classified as "k__Fungi" when these two sequences were submitted to the classifier together!
I would greatly appreciate if you would help to run the UNITE classifier on these two sequences and to see if the same observations can reproduced. Thank you so much!
I am using qiime2-2019.4. The following were what I did:
I fit the classifier following the exact commands in the tutorial (Fungal ITS analysis tutorial) and generated the classifier file: "unite-ver7-99-classifier-01.12.2017.qza"
The classification command I used was:
qiime feature-classifier classify-sklearn \
--i-classifier unite-ver7-99-classifier-01.12.2017.qza \
--i-reads seq.qza \
--p-confidence 0.7 \
--o-classification seq_tax.qza
Hi @chaibenl,
I believe the issue is that your input sequences are in mixed orientations (i.e., one is in the forward orientation relative to the reference sequences, and the other is in the reverse orientation).
classify-sklearn cannot currently handle mixed-orientation sequences, rather it tries to guess the orientation of sequences based on the first 100 or so sequences. SO that is why you get the correct classification when you classify one seq alone, but a different answer when classifying the two queries together… the classifications produced by this method should remain constant under normal circumstances.
To fix:
use the classify-consensus-vsearch classifier instead.
put all your sequences in the same orientation. Unfortunately, QIIME 2 does not have an official method for this right now… but if you can figure out a way to re-orient reads outside of QIIME 2 then you can re-import and classification should run smoothly.
I do see real ITS sequences assigned to “k__Fungi;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified”. Was that because the classifier training set contains sequences labeled as such?
Should only reference sequences with complete annotated lineages (Kingdom to species) be retained as the training set?
yes, if sequences with that annotation are in the classifier then query sequences can be classified as "unidentified" species.
that's totally a matter of personal taste (though it will impact accuracy). Just make sure you clearly document any steps you took to filter the database in any published results