Classifier trained on NCBI RefSeq 16S rRNA data gives weird results

Hello guys, this is my first post in the forum.
I have been a great fan of QIIME2 and its plugins.

I followed exactly this tutorial to download and train a classifier based on the NCBI RefSeq 16S rRNA data. The resulting classifier looks fine according to the evaluation and predicted-taxonomy outputs. However, when I used this classifier to classify my own data, which are 16S rRNA sequences spanning the V1-V9 region sequenced using PacBio, it gave me some weird results (see below).

The sequences themselves shouldn't be the culprit because the same dataset classified using a classifier trained on SILVA full-length 16S rRNA sequences gave me reasonable results (see below).

Any idea what had gone wrong here? Thanks!

Hi @mkcheung,

I often see results like this when the sequences are not in the same orientation as the reference database. This has been encountered before here and here.

Perhaps try running qiime rescript orient-seqs ... on your sequences. The scikit learn classifier tries to handle this (see one of the threads above), but may not always work. Also keep in mind that the NCBI RefSeq is not as expansive as SILVA, so some issues, like failing to detect the correct sequence orientation, may be exaggerated.

Hi Mike, thank you very much for your reply.
I followed your suggestion to run orient-seqs on my input sequences.
However, the same issue still persists.
Any further thoughts are appreciated :man_bowing: