Hello guys, this is my first post in the forum.
I have been a great fan of QIIME2 and its plugins.
I followed exactly this tutorial to download and train a classifier based on the NCBI RefSeq 16S rRNA data. The resulting classifier looks fine according to the evaluation and predicted-taxonomy outputs. However, when I used this classifier to classify my own data, which are 16S rRNA sequences spanning the V1-V9 region sequenced using PacBio, it gave me some weird results (see below).
I often see results like this when the sequences are not in the same orientation as the reference database. This has been encountered before here and here.
Perhaps try running qiime rescript orient-seqs ... on your sequences. The scikit learn classifier tries to handle this (see one of the threads above), but may not always work. Also keep in mind that the NCBI RefSeq is not as expansive as SILVA, so some issues, like failing to detect the correct sequence orientation, may be exaggerated.
@Nicholas_Bokulich, reminded me of another potential reason for what you are observing. That is, the primers for the V1V9 region may simply not be present in many of the reference sequences, which basically means that the PCR primer search and extraction using qiime feature-classifier extract-reads may not work well. I'd expect that the V1V9 region is almost equivalent to the "full-length" classifier we provide. Perhaps sanity check that the full-length classifier provides reasonable results, and let us know what you find.
If you know the alignment positions within the SILVA alignment, you could simply download and import the one of the aligned SILVA FASTA files from here, and extract with the alignment positions themselves, with the qiime rescript trim-alignment --p-position-start xxx --p-position-end yyy ... flags. Then you can use qiime rescript degap-seqs ... to remove the alignment gaps, then proceed as you normally would with the remainder of the SILVA tutorial.
Try out our alpha / beta release of our RESCRIPt extract-seq-segments command. In fact, this command was made with the thought that primer sequences may not be present within your reference reads.
Thank you for your reply again.
Actually the second bar plot in my 1st post was generated using the SILVA full-length classifier downloaded from the Data Resources page (just for a different QIIME2 version: 2020.11 in my case). The result looks reasonable.
Actually I've tried several other options:
use classify-consensus-vsearch on ncbi-refseqs.qza and ncbi-refseqs-taxonomy.qza generated following this tutorial, instead of classify-sklearn on ncbi-refseqs-classifier.qza generated following the same link. The result looks normal. So it seems like that the sequence and taxonomy files extracted from NCBI RefSeq using RESCRIPt are fine and the issue is caused by the classifier.
change --p-min-lens in the filter-seqs-length-by-taxon step (for Bacteria) from 1200 to 1400 to increase the read length in the ncbi-refseqs.qza generated, and then use the new ncbi-refseqs-classifier.qza to classify my reads: the same weird results remain.
By the way, the primers (27F and 1492R) have been removed from my sequences following the DADA2 for PacBio data protocol, which has also included a read orientation step.
So eventually I changed the --p-read-orientation option of classify-sklearn from 'auto' (the default) to 'same' and then the classifier worked nicely! So in the end the issue is still the read orientation as you initially suggested...
Thank you for your help!