Trouble using UNITE and naive bayes classifier

Hi,

I am hoping someone with more experience with the UNITE database can help me. I am working with qiime2-2017.8. I have downloaded and imported the 99% UNITE database into QIIME2, where I am trying to use it to assign taxonomy to my representative sequences.

As a first step, I use the ITS primer sequences to extract the region of interest and truncate to 250 bp. I then use the naive bayes classifier (qiime feature-classifier fit-classifier-naive-bayes) with the truncated reads. But when I try and use the classifier (qiime feature-classifier classify-sklearn with my Deblur rep set and my just built UNITE classifier) to assign taxonomy to fungal ITS sequences, every sequence is assigned to kingdom Fungi, a handful of sequences are also assigned to phylum Ascomycota, but I no sequences are assigned any taxa lower than phyla. When I train the classifier without truncating first, the UNITE database does a much better job.

I know the Werner paper mentioned on the features tutorial page is for 16S data, was I incorrect in trying to truncate reads for ITS? If I was, perhaps a note to this effect should be added to the training tutorial page…

Thanks for your input!

Hi @willowblade,
You are correct — in our experience ITS sequences should not actually be truncated prior to classification. The issue is that the sequences in UNITE do not necessarily cover the “full ITS” and some primer sequences can even be outside of the amplified region for sequences deposited in UNITE. I have raised an issue request here to modify the tutorials as you recommend.

This pre-print shows method optimization and comparative performance for q2-feature-classifier methods (and others), and gives recommendations for classifiers/parameter settings to use for ITS and 16S rRNA seqs.

One more extremely important tip: Use the UNITE “developer” datasets (these are contained within the normal UNITE releases in a separate directory). The non-developer sequences come pre-trimmed to the ITS domain and depending on which sequences you are using, they may be missing sections of amplicon that will be present in your sequences.

Hope that helps!

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.