Issues with taxonomic classification UNITE 9.0 and 10.0

Hello QIIME2 team :wave:

We wanted to share some issues we’ve been having with recent UNITE classifiers for fungal ITS. We found someone with a similar problem in a closed discussion, but unfortunately that information did not help us.

Our lab has been using UNITE to identify ITS fungal sequences in soil samples. We were previously using UNITE v. 8.3 and have tried to transition to newer versions as they’ve come out (v. 9.0 and 10.0). However, we’ve found that using these newer versions have resulted in larger proportions of features unidentified past the kingdom level (labeled as “Incertae_sedis”) than v. 8.3.

To properly compare all database version, we ran a test in qiime2-amplicon-2024.2. We trained three classifiers with the three most recent UNITE versions (8.3, 9.0, and 10.0 – all dynamic, singletons set as RefS) using q2 feature-classifier fit-classifier-naïve-bayes. Then, we used q2 feature-classifier classify-sklearn to classify features from 5 samples (using).

As you can see in our barplots at the phylum level, many features that were classified with UNITE 8.3 became unclassified Fungi (“p__Fungi_phy_Incertae_sedis”) in 9.0 and 10.0 (light purple bars).

This seems strange to us, as intuitively we’d expect to see a greater proportion of features identified in newer versions of classifiers compared to older versions. Also, when checking taxonomic IDs for individual features, many features that are marked as Incertae_sedis in new classifiers (9.0 and 10.0) had a deeper classification and with high ID confidence (>98%) in v8.3. There also isn’t any clear trend with regards to the features whose taxonomic IDs have been replaced with Incertae_sedis. With UNITE 8.3, these features comprised two phyla (Ascomycota and Mortierellomycota), three classes, five orders, nine families, and eleven genera.

We believe one possibility is that this is caused by the newest UNITE versions (9.0 and 10.0) having more reference sequences compared to v. 8.3 (19K vs. 14K, approximately)? By having more reference sequences, these newer databases might also have more taxonomic IDs with the same ITS sequence, leading to more ambiguity. However, this appears to be a concern at the species level, but not at higher taxonomic levels (Redirecting).

Another possibility is that there is some sort of internal issue (with the classifier, code, qiime version, etc.) which is interfering with the functioning of the newer UNITE versions. We tried training the classifiers and running the analyses in two different QIIME versions (v2021.8 and v2024.2), but found no differences in the taxonomic IDs between them.

We also tried running the 99% classifier (instead of dynamic) for v10.0 and found slight decreases in the proportion of “phyla_Incertae_sedis” IDs, though they were still prevalent among the 5 samples.

Apologies if this was too wordy :sweat:, but we were trying to be as thorough as possible.

Hopefully you’ll have some insight on what could be happening here. Any help/input is much appreciated!

Thank you so much for your time and for providing this space :blush:.

Our code for one UNITE version, as an example:

# IMPORT UNITE ID TO TAXA FILE ===
qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path sh_taxonomy_qiime_ver10_dynamic_04.04.2024.txt \
  --output-path id-to-taxa.qza

# IMPORT UNITE FASTA SEQUENCES ===
qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path sh_refs_qiime_ver10_dynamic_04.04.2024.fasta \
  --output-path fasta-sequences.qza

# TRAINING CLASSIFIER ===
qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads fasta-sequences.qza \
  --i-reference-taxonomy id-to-taxa.qza \
  --o-classifier its-unite-classifier_v10.0_2024-04-04_qiime2024.2.qza

# CLASSIFY SAMPLES ===
qiime feature-classifier classify-sklearn \
--i-reads representative_sequences.qza \
--i-classifier its-unite-classifier_v10.0_2024-04-04_qiime2024.2.qza \
--p-n-jobs 15 \
--verbose \
--o-classification taxonomy.qza

2 Likes

Hello Mica,

Poor ITS classification with UNITE was also seen here, though we never found its cause. This may be related to input data, not the database.

Thank you for investigating this!

1 Like

Hi @mica.tosi ,

Yes, if there are additional ambiguous sequences included this could explain the issue.

In a similar vein, we previously found (with an older UNITE release, maybe version 8.2) that removing unannotated/unidentified fungal sequences from the UNITE database improved classification accuracy:

But we did not remove the Incertae sedis sequences. I do not know off-hand if the Incertae sedis labels are coming from manual curation by the UNITE curators, or if these are the raw annotations given to the INSDC sequences. Indeed, due to the uncertain placement of certain clades Incertae sedis is a valid label at certain levels... but these sequences that are Incertae sedis at phylum through species level might just be garbage sequences from Genbank. It may be worth filtering these out of the database with q2-taxa and/or RESCRIPt prior to training your classifier, but I have not tested this so cannot be sure.

3 Likes