"over-classification" of ASV taxonomy?

Happy New Year Qiime Community! :tada:

I'm new to qiime2 and to DADA2 and to analyzing ITS data. :baby: I'm running qiime2-amplicon-2025.7 in a conda environment. Based on qiime2 forum recommendations, I'm using a UNITE database that includes all-eukaryotes and singletons and is the developer version: sh_refs_qiime_ver10_dynamic_s_all_19.02.2025_dev.fasta :mushroom:

I'm using mock fungal libraries to build and test an ITS1 Illumina amplicon sequence analysis pipeline with qiime2, with DADA2 to infer sample ASVs.

RESCRIPt has worked very well for me in filtering and evaluating a UNITE database, and evaluating the resulting naive-Bayes classifier. :green_heart:

When I used the qiime rescript dereplicate action to dereplicate my UNITE db sequence file, I selected the default option of --p-mode 'uniq', which retains multiple different taxonomic attributions to a given identical DNA sequence.

My question is: I seem to have "over-classification" of at least one of my measured ASVs from the mock libraries, and I'm not sure if this is likely due to UNITE database de-replication mode, my classifier, or some other explanation. FWIW, this pipeline passed QC for DADA2 denoising and sample inference, and also RESCRIPt evaluation for db and classifier, as far as my understanding of tutorials has indicated.

By "over-classification", I mean that this ASV is 143 bp (seq below), and is classified as k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Didymellaceae;g__Neoascochyta by my pipeline. When I did a BLAST search of this DNA sequence, it had 100% identity and 100% query coverage to at least three different annotations: Neoaschocyta paspali, Aschochyta hordei, and Ustulina deusta.

So from the BLAST result, my understanding is that this sequence is too uninformative (short) to have classification depth past the family level. Even if this sequence is accurately attributed to these three species in my -p-mode 'uniq' UNITE db, my expectation is that the classifier would have essentially chosen the LCA of all of the different taxonomic annotations for those identical DNA sequences, and left empty ranks at least at the genus and species level. Instead, my classifier chose the Neoaschocyta genus classification.

I'm concerned that I should have chosen the p-mode 'lca' option instead of p-mode 'unique' when filtering my UNITE dbs, in order to avoid this "over-classification" (creating false positives at deeper classification ranks of short sequences, for example).

Another possible explanation is that the naive-Bayes classifier takes priors into account when classifying. For example, if this DNA sequence A has LCA taxonomy at the family level, but there's another sequence B in the same sample that has the same LCA taxonomy as A, but is unambiguously known to be from g__Neoascochyta (maybe it's longer than seq A), then the classifier will decide that this genus is more likely than the other two genera for seq A.

That's an interesting idea and seems sensible from a statistics standpoint, though I'm less enthusiastic from a biological standpoint (for example, similar species would be more competitive and less likely to coexist in the same niche for some scenarios).

I'd be grateful for any thoughts or advice you might have! :folded_hands:

Here's the DNA sequence of my "over-classified" ASV:

TTACCGAGAGTTGTAGGCTTCTGTCTACCATCTCTTACCCATGTCTTTTGCGTACTACACGTTTCCTCGGCAGGTCCGCCTGCCGCTAGGACAATTTAAACCATTTGCAGTTGCAGTCAGCGTCTGAAAAACTTAATAATTAC

2 Likes

Hi @sibilant ,

I would not jump to the conclusion of over-classification, but rather inspect the reference before coming to conclusions.

and are all three of these in your reference database? (I assuming you used the NCBI BLAST online tool, not running blastn locally.) And are you blasting against the NCBI RefSeqs (more reliable) or full nt database (less reliable, lots of misclassifications).

no, not necessary. Because the classifiers in q2-feature-classifier are doing their own LCA in a sense. Keeping that uniq information gives more flexibility downstream (e.g., for weighting species). But it's a bit of a personal taste, LCA is just fine.

No, by default the naive Bayes classifiers use a uniform prior, so assume that all species are equally likely to be observed. You can adjust class weights during model training, this will improve classification accuracy though building a good prior can be challenging (we have never tried this with fungal data because when we designed that method good prior data did not exist at the time, but for some environments like soil this would be possible). Weights would be a good way to address this problem, though, if you have prior data!

3 Likes

Thanks again @Nicholas_Bokulich!

You are correct that only the Neoaschocyta paspali appeared in my filtered UNITE db, and Aschochyta hordei, and Ustulina deusta did not (according to grep). So that explains why the classifier chose this taxon for the sequence.

My original BLAST search was against the core_nt NCBI database. I repeated the search (megablast) against taxids for these three species above from NCBI's ITS database, and Neoaschocyta paspali was by far the most accurate classification based on %identity and %alignment.

Thanks also for the explanation of naive-Bayes and the option of using weighted priors; helpful bonus.

It was clear in my mind, if not my original post, that the most likely explanation for this seeming discrepancy ("over-classification") was a misunderstanding in my own learning curve, and not the tools themselves. This example was an object lesson for me in not comparing apples to oranges: different taxonomic databases will often give different results.
Thanks very much again for your help! :qiime2:

3 Likes