Training the classifier for ITS with or without Primer trimming


I have training the classifier for ITS2 sequencing analysis. Firstly, I trained the classifier and trimmed it based on the primers-

qiime feature-classifier extract-reads --i-sequences unite_dyn_seqs.qza --p-f-primer GTGARTCATCGARTCTTTG --p-r-primer TTCCTSCGCTTATTGATATGC --o-reads UNITE_DB_Trimmed.seqs.qza
qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads UNITE_DB_Trimmed.seqs.qza --i-reference-taxonomy unite_dyn_tax.qza --o-classifier UNITE_DB_classifier.qza
qiime feature-classifier classify-sklearn --i-reads merged_rep-seqs.qza --i-classifier UNITE_DB_classifier.qza --p-n-jobs 10 --o-classification New_taxonomy.qza

Using the above commands, taxonomic information is very much incomplete and majority of the reads were unclassified beyond Phylum level.

Then I read here that “fungal ITS classifiers trained on the UNITE reference database do NOT benefit from extracting/trimming reads to primer sites.”. Therefore I performed this-
qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads unite.qza --i-reference-taxonomy unite-taxonomy.qza --o-classifier classifier.qza
qiime feature-classifier classify-sklearn --i-reads merged_rep-seqs.qza --i-classifier database/classifier.qza --p-n-jobs 25 --o-classification New_Without_trimming_taxonomy.qza

Using the above commands, I observed a better classification in the taxonomy.

But I am observing the different classification in the two output for some ASVs, e.g.,

Old classification output where I used primer based trimming-
f6d792e7f1cb801fea8bd66e1982f0b7: k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Helotiales;f__Helotiales_fam_Incertae_sedis;g__Coleophoma
02dc27af4c76269e29b031e7b8cbe08a: k__Fungi;p__Ascomycota;c__Lecanoromycetes

New classification where I didn’t use any trimming based on primers.
f6d792e7f1cb801fea8bd66e1982f0b7: k__Fungi;p__Ascomycota;c__Arthoniomycetes;o__Lichenostigmatales;f__Phaeococcomycetaceae;g__Phaeococcomyces;s__unidentified
02dc27af4c76269e29b031e7b8cbe08a: k__Fungi;p__Ascomycota;c__Dothideomycetes

I am observing differences in the taxonomy beyond class level. So, which classification is better or correct for downstream analysis? Any suggestions?



Hi @shashankgpt,
Interesting — I do not think that I have not seen such disparate results between trimmed/untrimmed for ITS classifiers. This may be related to your primers. We have seen some rare cases where specific primer sets lead to a handful of unusually short reads output by extract-reads, which befuddle the classifier.

So my hunch is that the untrimmed classifier may be more accurate in this case, but you can verify with two things:

  1. knowledge: which taxa are more likely to be present in the samples you are studying?
  2. second opinion: try one of the other classifiers in q2-feature-classifier (on untrimmed reads), and/or use NCBI BLAST on a couple of these ASVs to see which appears to be the closest match (note: neither of these will necessarily be “correct” but you can get a consensus prediction by looking at these).

Please let us know what you find!

Hi @Nicholas_Bokulich

I am looking at the dust microbiome

I performed NCBI BLAST, and turns out, un-trimmed reads are providing the correct taxonomy classification, with 100 % query cover and percentage identity over 97%.

The closest match using NCBI BLAST results on ASVs
f6d792e7f1cb801fea8bd66e1982f0b7 - Ascomycota; Pezizomycotina;
Arthoniomycetes; Lichenostigmatales; Phaeococcomycetaceae;

02dc27af4c76269e29b031e7b8cbe08a - Ascomycota; Pezizomycotina;
Dothideomycetes; Pleosporomycetidae; Pleosporales; Pleosporineae;
Phaeosphaeriaceae; Dematiopleospora.

Finally, it looks like, NCBI BLAST also provides deeper classification.

Thanks for confirming!

NCBI BLAST will always provide deeper classification because it has no way to provide a consensus classification — it will always report the top hits! And multiple different species can often be equally similar to your query sequence, especially when looking at short marker-gene amplicons. So it is not advisable to rely on NCBI BLAST.

QIIME 2 and pretty much all other marker-gene sequence classification methods out there (e.g., RDP, mothur) provide incomplete classification results because they are performing a consensus classification and/or determining the confidence at which a short sequence can be classified.

So BLAST results look better/more satisfying because you get species all the time… but that is actually a bad thing more often than not.

q2-feature-classifier is actually finding a species-level match, but it is listed as “unidentified” in the reference database.