Hello QIIME2 team
We wanted to share some issues we’ve been having with recent UNITE classifiers for fungal ITS. We found someone with a similar problem in a closed discussion, but unfortunately that information did not help us.
Our lab has been using UNITE to identify ITS fungal sequences in soil samples. We were previously using UNITE v. 8.3 and have tried to transition to newer versions as they’ve come out (v. 9.0 and 10.0). However, we’ve found that using these newer versions have resulted in larger proportions of features unidentified past the kingdom level (labeled as “Incertae_sedis”) than v. 8.3.
To properly compare all database version, we ran a test in qiime2-amplicon-2024.2. We trained three classifiers with the three most recent UNITE versions (8.3, 9.0, and 10.0 – all dynamic, singletons set as RefS) using q2 feature-classifier fit-classifier-naïve-bayes. Then, we used q2 feature-classifier classify-sklearn to classify features from 5 samples (using).
As you can see in our barplots at the phylum level, many features that were classified with UNITE 8.3 became unclassified Fungi (“p__Fungi_phy_Incertae_sedis”) in 9.0 and 10.0 (light purple bars).
This seems strange to us, as intuitively we’d expect to see a greater proportion of features identified in newer versions of classifiers compared to older versions. Also, when checking taxonomic IDs for individual features, many features that are marked as Incertae_sedis in new classifiers (9.0 and 10.0) had a deeper classification and with high ID confidence (>98%) in v8.3. There also isn’t any clear trend with regards to the features whose taxonomic IDs have been replaced with Incertae_sedis. With UNITE 8.3, these features comprised two phyla (Ascomycota and Mortierellomycota), three classes, five orders, nine families, and eleven genera.
We believe one possibility is that this is caused by the newest UNITE versions (9.0 and 10.0) having more reference sequences compared to v. 8.3 (19K vs. 14K, approximately)? By having more reference sequences, these newer databases might also have more taxonomic IDs with the same ITS sequence, leading to more ambiguity. However, this appears to be a concern at the species level, but not at higher taxonomic levels (Redirecting).
Another possibility is that there is some sort of internal issue (with the classifier, code, qiime version, etc.) which is interfering with the functioning of the newer UNITE versions. We tried training the classifiers and running the analyses in two different QIIME versions (v2021.8 and v2024.2), but found no differences in the taxonomic IDs between them.
We also tried running the 99% classifier (instead of dynamic) for v10.0 and found slight decreases in the proportion of “phyla_Incertae_sedis” IDs, though they were still prevalent among the 5 samples.
Apologies if this was too wordy , but we were trying to be as thorough as possible.
Hopefully you’ll have some insight on what could be happening here. Any help/input is much appreciated!
Thank you so much for your time and for providing this space .
Our code for one UNITE version, as an example:
# IMPORT UNITE ID TO TAXA FILE ===
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path sh_taxonomy_qiime_ver10_dynamic_04.04.2024.txt \
--output-path id-to-taxa.qza
# IMPORT UNITE FASTA SEQUENCES ===
qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path sh_refs_qiime_ver10_dynamic_04.04.2024.fasta \
--output-path fasta-sequences.qza
# TRAINING CLASSIFIER ===
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads fasta-sequences.qza \
--i-reference-taxonomy id-to-taxa.qza \
--o-classifier its-unite-classifier_v10.0_2024-04-04_qiime2024.2.qza
# CLASSIFY SAMPLES ===
qiime feature-classifier classify-sklearn \
--i-reads representative_sequences.qza \
--i-classifier its-unite-classifier_v10.0_2024-04-04_qiime2024.2.qza \
--p-n-jobs 15 \
--verbose \
--o-classification taxonomy.qza