Unexpected taxonomic assignment results from two classifiers trained on nearly identical sequences

HI @mammerlin,

I did get the same taxonomic classifications as you did using the older 138 SILVA database. But when I used SILVA v138.2, processed similarly as the 138 (i.e. cull-seqs, filter-seqs-length-by-taxon, then dereplicate), I obtained valid hits, as before.

CCS-silva-138.2-classified.qzv (1.3 MB)

Note: there is only one minor difference within the provenance between how I processed and what is provided on the data resources page: I manually ran reverse-transcribe after pulling the database and then ran cull-seqs... which I did not need to do as cull-seqs will do the reverse transcribing for you. But I was trying out a few other things at the time. Other than that the curation is identical.

I also trimmed your reads like so for the 138 and 138.2:

qiime feature-classifier extract-reads \
    --i-sequences CCS.qza \
    --p-f-primer CCTACGGGNGGCWGCAG \
    --p-r-primer CGGTGTGTACAAGGCCCGGGAACG \
    --o-reads CCS-V3V8-trimmed.qza

and classified them:

qiime feature-classifier classify-sklearn \
    --i-classifier silva-138.2-classifier.qza \
    --i-reads CCS-V3V8-trimmed.qza  \
    --o-classification CCS-V3V8-trimmed-classified.qza 

CCS-V3V8-trimmed-classified.qzv (1.3 MB)

Again, for the 138.0 db the classification was erroneous. But for 138.2 the classification seemed sane. This tells me that it is a combination of what reference reads are available within the reference database, used to assess orientation, and how these reference reads are curated are contributing to what we see.

I find it interesting that skipping the cull-seqs, filter-seqs-length-by-taxon steps seemed to result in a good classification for 138.0, but when applying them they result in erroneous classifications with respect to your PacBio data. I've not observed this before, but then again every data set is different. This does not appear to be the case for 138.2 ( I assume 138.1 will be identical as the only thing different between 138.1 and 138.2 is the taxon labels).

This is why we stress that users should be cognizant of their data and what form it is in. For example... knowing the read direction of your data and that of the reference database, data type (i.e. illumina, pacbio, etc...), among other things. As implied in my observations above database curation is sometimes not trivial. As with any analysis, any decisions made during the curation process can alter taxonomic classification results.

Again, if you know the data being generated are going to be in the 5'-3' direction then it is good practice to be explicit and set --p-read-orientation same, from a reproducibility standpoint anyway. I often do this myself.

Otherwise, if your PacBio output is generating reads in mixed orientation, you can try running rescript orient-seqs on your ASVs. Then you can set --p-read-orientation same during classification. This should help your reads being oriented to the reference database, and you should be good to go. But again, not everything is perfect, and there may be orientation issues there too. I do not have much experience with PacBio data, so you may need to experiment a little.

My recommendation would be to keep things simple, use the latest SILVA 138.2 database, and set --p-read-orientation same.

2 Likes