Fungal ITS UNITE Naive Baysian Classifier problem

chaibenl · January 12, 2020, 2:55am

Dear Qiime 2 help desk,
I met a strange problem using a UNITE ITS-trained classifier in Qiime 2, i.e. some sequences are classified with significantly lower resolutions if they are submitted with other sequences than that they are submitted alone!

I picked up two sequences to make a very simple example:

123830878d97229ae38d4d57ca68335c
ATGATTACATTCATTACATTTAGAAGTTTGTGTAAAACGTGCCGAAGCACATAAACAGTTCACAGGTGTAGATGGGTAGATAAATGGACCAAAGTCCAATATTCTCTACTGATCCTTCCGCAG
10ad94d905b178072ca910a1bb446c1d
TAGAGAATATTGGACTTTGGTCCATTTATCTACCCATCTACACCTGTGAACTGTTTATGTGCTTCGGCACGTTTTACACAAACTTCTAAATGTAATGAATGTAATCATATTATAACAATAATA

When individually submitted, they were both classified as: k__Fungi;p__Basidiomycota;c__Tremellomycetes;o__Tremellales;f__Tremellaceae;g__Cryptococcus;s__Cryptococcus_neoformans

However, sequence "123830878d97229ae38d4d57ca68335c" was classified as "k__Fungi" when these two sequences were submitted to the classifier together!

I would greatly appreciate if you would help to run the UNITE classifier on these two sequences and to see if the same observations can reproduced. Thank you so much!

I am using qiime2-2019.4. The following were what I did:

I downloaded a 2017 UNITE reference set from (wget https://files.plutof.ut.ee/doi/0A/0B/0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip).
I fit the classifier following the exact commands in the tutorial (Fungal ITS analysis tutorial) and generated the classifier file: "unite-ver7-99-classifier-01.12.2017.qza"
The classification command I used was:
qiime feature-classifier classify-sklearn \
--i-classifier unite-ver7-99-classifier-01.12.2017.qza \
--i-reads seq.qza \
--p-confidence 0.7 \
--o-classification seq_tax.qza

Nicholas_Bokulich · January 13, 2020, 2:44am

Hi @chaibenl,
I believe the issue is that your input sequences are in mixed orientations (i.e., one is in the forward orientation relative to the reference sequences, and the other is in the reverse orientation).

classify-sklearn cannot currently handle mixed-orientation sequences, rather it tries to guess the orientation of sequences based on the first 100 or so sequences. SO that is why you get the correct classification when you classify one seq alone, but a different answer when classifying the two queries together... the classifications produced by this method should remain constant under normal circumstances.

To fix:

use the classify-consensus-vsearch classifier instead.
put all your sequences in the same orientation. Unfortunately, QIIME 2 does not have an official method for this right now... but if you can figure out a way to re-orient reads outside of QIIME 2 then you can re-import and classification should run smoothly.

Good luck!

Nicholas_Bokulich · January 13, 2020, 8:58pm

2 posts were split to a new topic: Unidentified sequences in the UNITE database — why can classify-sklearn classify these?

chaibenl · January 13, 2020, 10:39pm

Hi, Nicholas,

Great answers!

I do see real ITS sequences assigned to "k__Fungi;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified". Was that because the classifier training set contains sequences labeled as such?

Should only reference sequences with complete annotated lineages (Kingdom to species) be retained as the training set?

Thank you.

Nicholas_Bokulich · January 13, 2020, 11:33pm

yes, if sequences with that annotation are in the classifier then query sequences can be classified as "unidentified" species.

that's totally a matter of personal taste (though it will impact accuracy). Just make sure you clearly document any steps you took to filter the database in any published results

chaibenl · January 16, 2020, 2:37pm

Thank you, Nicholas! That's very helpful.

system · February 16, 2020, 8:37pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.