Hi Qiime2 community,
![]()
I am new to qiime2 and ITS analysis. I am running qiime2-amplicon-2025.7 in conda.
Following recommendations on the qiime2 forum, I chose the developer version of the latest UNITE db versions (v10), containing all eukaryotes, including singletons, with dynamic clustering of SH.
I used RESCRIPt to filter and evaluate the UNITEdbs I chose; it worked great. ![]()
To check the prevalence of various suggested filter terms in my unfiltered database, I used grep.
There were many reports in the literature, on various GitHub issue trackers, and in the qiime2 forum about the prevalence of "unclassified", "undefined", "unknown" "unassigned" taxonomic annotations in the UNITE db. For example, the UNITE db qiime2 resource itself indicates the following:
Missing information is indicated as "unidentified" item; “f__unidentified;” means that no family name for the sequence exists.
Surprisingly Incertae_sedis was the only recommended filter term that occurred in my unfiltered db.
I performed controls to make sure I was using grep as intended:
--ignore-case or -i is case insensitive, so it should return Fungi and fungi
--fixed-strings treats the quoted term as a string instead of regex
--count returns number of lines in the file containing that quoted string
Sorry I didn't use identical grep search conditions below; this unfolded over several months and I'm summarizing here.
grep --count --ignore-case "Fungi" sh_taxonomy_qiime_ver10_dynamic_s_all_19.02.2025_dev.txt
168049
As expected, returned a lot of entries.
Negative control: expect to return zero entries. ![]()
grep --count --ignore-case "Dinosaur" sh_taxonomy_qiime_ver10_dynamic_s_all_19.02.2025_dev.txt
0
Now for the test: searching for filter terms related to unclassified taxa.
grep --fixed-strings --count -i 'unidentified' sh_taxonomy_qiime_ver10_dynamic_s_all_19.02.2025_dev.txt
0
grep --fixed-strings --count -i 'unknown' sh_taxonomy_qiime_ver10_dynamic_s_all_19.02.2025_dev.txt
0
grep --fixed-strings --count -i 'unclassified' sh_taxonomy_qiime_ver10_dynamic_s_all_19.02.2025_dev.txt
0
grep --fixed-strings --count -i 'unassigned' sh_taxonomy_qiime_ver10_dynamic_s_all_19.02.2025_dev.txt
0
grep --fixed-strings --count 'Incertae_sedis' sh_taxonomy_qiime_ver10_dynamic_s_all_19.02.2025_dev.txt
94610
I had read on a GitHub issue tracker that using the dev version reduced incidence of "unidentified" classifications, so I checked the NOT dev version of my UNITE db download, hoping to see these terms appear and confirm my understanding of the situation: no dice.
grep --fixed-strings --count -i 'fungi' sh_taxonomy_qiime_ver10_dynamic_s_all_19.02.2025.txt
168049
grep --fixed-strings --count -i 'unknown' sh_taxonomy_qiime_ver10_dynamic_s_all_19.02.2025.txt
0
grep --fixed-strings --count -i 'unclassified' sh_taxonomy_qiime_ver10_dynamic_s_all_19.02.2025.txt
0
grep --fixed-strings --count -i 'unidentified' sh_taxonomy_qiime_ver10_dynamic_s_all_19.02.2025.txt
0
grep --fixed-strings --count -i 'unassigned' sh_taxonomy_qiime_ver10_dynamic_s_all_19.02.2025_dev.txt
0
I also checked a version of the UNITE db that I downloaded via the qiime2 plug-in, with same results (not shown).
At this point I was afraid I was using grep and/or qiime incorrectly, and/or just misunderstanding what I had read (e.g. that UNITE dbs contain a lot of terms such as "unclassified", and we can filter them out to improve classification performance).
So, I looked on the qiime2 forum for one of the posts describing a large number of unclassified taxa.
This forum post described a different version of the UNITE db (v9): sh_refs_qiime_ver9_97_s_29.11.2022
So, I downloaded this database to see if I could detect these unclassifieds.
grep --fixed-strings --count -i 'Incertae' sh_taxonomy_qiime_ver9_97_s_29.11.2022.txt
74857
grep --fixed-strings --count -i 'unclassified' sh_taxonomy_qiime_ver9_97_s_29.11.2022.txt
0
grep --fixed-strings --count -i 'unassigned' sh_taxonomy_qiime_ver9_97_s_29.11.2022.txt
0
grep --fixed-strings --count -i 'unidentified' sh_taxonomy_qiime_ver9_97_s_29.11.2022.txt
0
grep --fixed-strings --count -i 'unknown' sh_taxonomy_qiime_ver9_97_s_29.11.2022.txt
0
So, that's 0/4 databases where I've been able to detect these widely reported filter terms. By far the most likely explanations are that I'm misunderstanding what I read about the presence of these terms in the UNITEdb, and/or I'm misusing grep. If so, I'd be grateful to anyone who can enlighten me.
More remote explanations include that these terms are more prevalent in some UNITEdbs than others: maybe earlier versions of UNITEdb? If anyone else detected these terms with grep or some other method in a particular UNITE db, I'd be glad to know.
Alternatively, it could be that the 'unclassified' etc results reported are actually not part of the UNITEdb, but occur downstream of that i.e. the classifier output. For example, at one point my classifier was interpreting my reads as the wrong orientation, so I had 100% Unassigned. With the same database / sequences / classifier but specifying the opposite orientation, only two of 45 ASVs were Unassigned. Additionally, the person in the forum post above with "unclassifieds" was using nanopore reads, which I think would be consistent with the explanation that it's not the UNITEdb but rather the classifier results that are applying the term "unclassified".
Hopefully my question is clear: tl;dr why am I not detecting "unclassified" etc terms in unfiltered UNITE dbs via grep?
This is not time sensitive (I hope lol) as I'm satisfied with the results of my filtered UNITEdb and classifer, but I want to be on the same page as the community about this characterization of UNITEdb filter terms, and as a beginner, I'm nervous that if I don't pull on this loose thread now, I'll regret it later.
Thank you very much for any info or suggestions! ![]()