Interpreting Unassigned/k_Fungi ITS taxonomy generated using 3 reference datasets (UNITE & NCBI)

ardorbel · February 17, 2022, 6:50am

Hello,

I'm currently comparing taxonomy results for ITS1 sequences derived from sediment samples (amplified using ITS1F/ITS2), obtained using three different naive-bayes classifiers (trained in qiime2) using the following reference datasets:

UNITE v 8.3 dynamic all eukaryotes (qiime release)
UNITE v 8.3 dynamic fungi (qiime release)
NCBI BioProj 177353 (Fungal ITS) trained using RESCRIPt

I've attached the resulting taxonomic comparison for your reference.
taxonomy-compare.txt (4.1 MB)

Although there is generally good consensus between these classifiers, one major difference I see is that the all eukaryote UNITE classifier returns "Unassigned" classifications (with various confidence values) for many of my features, while the fungus-specific UNITE and NCBI-ITS classifiers return "k_Fungi" (confidence value = 1) for almost all of the same features. Is the "k_Fungi" kingdom classification to be trusted or does the "1" confidence value here actually somehow indicate the opposite of 100% confidence - that these should be considered "unassigned" like the UNITE eukaryote classifier is saying?

I've scoured the forum, but was unable to find anyone with the same predicament. Before considering these features as poorly classified fungi, I want to make sure they're not actually erroneously classified coamplification of some kind. I appreciate your input regarding how I might best answer this question using QIIME2.

Thanks!

Nicholas_Bokulich · February 17, 2022, 8:00am

No — if you train a classifier on only fungal samples, then k__fungi is the root, and there is no outlier for comparison. So basically any sequence that is not fungal but even remotely resembles any reference sequence (e.g., has some As, Cs, Gs, and Ts ) could be classified at the root. Only vastly dissimilar sequences would be left unassigned in this case. So this is a very good reason to include some outliers in the database (maybe not all Eukaryotes, but at least those expected in your samples).

depending on the primers, co-amplification is very likely. Most ITS primers amplify plants and most other eukaryotes, so in sediments I would guess that this is probable.

I still recommend spot-checking a few of these. You can use NCBI BLAST to see what these might be (non-target amplification or other junk?). As you have likely read on the forum, the classify-sklearn method assumes that all reads are in the same orientation, so mixed-orientation reads can also be poorly classified or unassigned because classification will only be performed in one direction (you can use RESCRIPt to re-orient these reads, or filter out all reads that are not classified to at least phylum level and re-classify with classify-sklearn in the reverse orientation).

good luck!

system · March 20, 2022, 2:01pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.