Classifier trained on NCBI RefSeq 16S rRNA data gives weird results

mkcheung · August 12, 2022, 8:34am

Thank you for your reply again.
Actually the second bar plot in my 1st post was generated using the SILVA full-length classifier downloaded from the Data Resources page (just for a different QIIME2 version: 2020.11 in my case). The result looks reasonable.

Actually I've tried several other options:

use classify-consensus-vsearch on ncbi-refseqs.qza and ncbi-refseqs-taxonomy.qza generated following this tutorial, instead of classify-sklearn on ncbi-refseqs-classifier.qza generated following the same link. The result looks normal. So it seems like that the sequence and taxonomy files extracted from NCBI RefSeq using RESCRIPt are fine and the issue is caused by the classifier.
change --p-min-lens in the filter-seqs-length-by-taxon step (for Bacteria) from 1200 to 1400 to increase the read length in the ncbi-refseqs.qza generated, and then use the new ncbi-refseqs-classifier.qza to classify my reads: the same weird results remain.

By the way, the primers (27F and 1492R) have been removed from my sequences following the DADA2 for PacBio data protocol, which has also included a read orientation step.