Using the above commands, taxonomic information is very much incomplete and majority of the reads were unclassified beyond Phylum level.
Then I read here that “fungal ITS classifiers trained on the UNITE reference database do NOT benefit from extracting/trimming reads to primer sites.”. Therefore I performed this-
qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads unite.qza --i-reference-taxonomy unite-taxonomy.qza --o-classifier classifier.qza
qiime feature-classifier classify-sklearn --i-reads merged_rep-seqs.qza --i-classifier database/classifier.qza --p-n-jobs 25 --o-classification New_Without_trimming_taxonomy.qza
Using the above commands, I observed a better classification in the taxonomy.
But I am observing the different classification in the two output for some ASVs, e.g.,
Old classification output where I used primer based trimming-
New classification where I didn’t use any trimming based on primers.
I am observing differences in the taxonomy beyond class level. So, which classification is better or correct for downstream analysis? Any suggestions?
Interesting — I do not think that I have not seen such disparate results between trimmed/untrimmed for ITS classifiers. This may be related to your primers. We have seen some rare cases where specific primer sets lead to a handful of unusually short reads output by extract-reads, which befuddle the classifier.
So my hunch is that the untrimmed classifier may be more accurate in this case, but you can verify with two things:
knowledge: which taxa are more likely to be present in the samples you are studying?
second opinion: try one of the other classifiers in q2-feature-classifier (on untrimmed reads), and/or use NCBI BLAST on a couple of these ASVs to see which appears to be the closest match (note: neither of these will necessarily be “correct” but you can get a consensus prediction by looking at these).
NCBI BLAST will always provide deeper classification because it has no way to provide a consensus classification — it will always report the top hits! And multiple different species can often be equally similar to your query sequence, especially when looking at short marker-gene amplicons. So it is not advisable to rely on NCBI BLAST.
QIIME 2 and pretty much all other marker-gene sequence classification methods out there (e.g., RDP, mothur) provide incomplete classification results because they are performing a consensus classification and/or determining the confidence at which a short sequence can be classified.
So BLAST results look better/more satisfying because you get species all the time… but that is actually a bad thing more often than not.
q2-feature-classifier is actually finding a species-level match, but it is listed as “unidentified” in the reference database.