Hi all,
I'm analysing a small dataset of six samples with 113 ASVs. I'm assigning taxonomy through the Naive Bayes classifier. I wanted to compare (A) the results obtained with default settings with (B) the results obtained when forcing the classifier to give me the identification for the finest taxonomic level (i.e., switching off the sklearn's default confidence-based taxonomic level restriction using "--p-confidence 0" parameter). These are the exact commands I'm using:
qiime feature-classifier classify-sklearn
--i-classifier V4-classifier.qza
--i-reads filtered-rep-seqs-uganda.qza
--o-classification taxonomy-uganda-restricted.qza
qiime feature-classifier classify-sklearn
--i-classifier V4-classifier.qza
--i-reads filtered-rep-seqs-uganda.qza
--p-confidence 0
--o-classification taxonomy-uganda-non-restricted.qza
(I'm using QIIME2 v. 2022.8 in a conda environment through WSL2.)
It works well but I was at first confused by the fact that I'm receiving different numbers of genera (11 vs 14 for restricted vs unrestricted assignment) as I have expected identical results differing only in the taxonomic assignments themselves (please, see the attached picture; restricted left, unrestricted middle, what I'd expect on the right). Then, I've realized the obvious, that the difference is caused by the fact that, in the restricted taxonomic assignment, more ASVs (and thus also genera found in the unrestricted analysis) are merged within the incomplete assignments such as:
- k__Eukaryota;p__Bacillariophyta;c__Bacillariophyceae;;;__
- k__Eukaryota;p__Bacillariophyta;;;;
- k__Eukaryota;p__Bacillariophyta;c__Bacillariophyceae;o__Cocconeidales;; and so on
So my question follows. When I'm interested in the number of genera alone, isn't it actually more meaningful to extract it from the results based on the unrestricted assignment? I guess there is a good chance that, even if the taxon itself is incorrect, the number of genus-level differences (i.e. number of genera) would be estimated more accurately (leaving aside the general database-related issues) than in the default restricted analysis. What do you think? I'll be grateful for any and all ideas.