Happy New Year Qiime Community! ![]()
I'm new to qiime2 and to DADA2 and to analyzing ITS data.
I'm running qiime2-amplicon-2025.7 in a conda environment. Based on qiime2 forum recommendations, I'm using a UNITE database that includes all-eukaryotes and singletons and is the developer version: sh_refs_qiime_ver10_dynamic_s_all_19.02.2025_dev.fasta ![]()
I'm using mock fungal libraries to build and test an ITS1 Illumina amplicon sequence analysis pipeline with qiime2, with DADA2 to infer sample ASVs.
RESCRIPt has worked very well for me in filtering and evaluating a UNITE database, and evaluating the resulting naive-Bayes classifier. ![]()
When I used the qiime rescript dereplicate action to dereplicate my UNITE db sequence file, I selected the default option of --p-mode 'uniq', which retains multiple different taxonomic attributions to a given identical DNA sequence.
My question is: I seem to have "over-classification" of at least one of my measured ASVs from the mock libraries, and I'm not sure if this is likely due to UNITE database de-replication mode, my classifier, or some other explanation. FWIW, this pipeline passed QC for DADA2 denoising and sample inference, and also RESCRIPt evaluation for db and classifier, as far as my understanding of tutorials has indicated.
By "over-classification", I mean that this ASV is 143 bp (seq below), and is classified as k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Didymellaceae;g__Neoascochyta by my pipeline. When I did a BLAST search of this DNA sequence, it had 100% identity and 100% query coverage to at least three different annotations: Neoaschocyta paspali, Aschochyta hordei, and Ustulina deusta.
So from the BLAST result, my understanding is that this sequence is too uninformative (short) to have classification depth past the family level. Even if this sequence is accurately attributed to these three species in my -p-mode 'uniq' UNITE db, my expectation is that the classifier would have essentially chosen the LCA of all of the different taxonomic annotations for those identical DNA sequences, and left empty ranks at least at the genus and species level. Instead, my classifier chose the Neoaschocyta genus classification.
I'm concerned that I should have chosen the p-mode 'lca' option instead of p-mode 'unique' when filtering my UNITE dbs, in order to avoid this "over-classification" (creating false positives at deeper classification ranks of short sequences, for example).
Another possible explanation is that the naive-Bayes classifier takes priors into account when classifying. For example, if this DNA sequence A has LCA taxonomy at the family level, but there's another sequence B in the same sample that has the same LCA taxonomy as A, but is unambiguously known to be from g__Neoascochyta (maybe it's longer than seq A), then the classifier will decide that this genus is more likely than the other two genera for seq A.
That's an interesting idea and seems sensible from a statistics standpoint, though I'm less enthusiastic from a biological standpoint (for example, similar species would be more competitive and less likely to coexist in the same niche for some scenarios).
I'd be grateful for any thoughts or advice you might have! ![]()
Here's the DNA sequence of my "over-classified" ASV:
TTACCGAGAGTTGTAGGCTTCTGTCTACCATCTCTTACCCATGTCTTTTGCGTACTACACGTTTCCTCGGCAGGTCCGCCTGCCGCTAGGACAATTTAAACCATTTGCAGTTGCAGTCAGCGTCTGAAAAACTTAATAATTAC