Sklearn-classifier bug for archaea 16s amplicons

Hi every one,
I used QIIME 2 (201907) for taxonomic classification. I experienced a problem with classification of my archaea targeted 16s rRNA sequences with the plugin feature-classifier classify-sklearn. All the representative reads generated by Vsearch were classified as bacteria while taxonomic classification of the representative reads generated by Deblur seemed normal.

So, I tested Sklearn-classifier with 15 archaea 16s rRNA represent reads from Vsearch representative reads. The result showed something wrong that all the 15 reads were classified as bacteria.

when I remove one read, the result seemed normal.

Here are my command lines:
qiime feature-classifier extract-reads
--i-sequences gg_99_otus.qza
--p-trunc-len 0
--o-reads ref-seqs.qza

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref-seqs.qza
--i-reference-taxonomy ref-taxonomy.qza
--o-classifier classifier.qza

qiime feature-classifier classify-sklearn
--i-classifier classifier.qza
--i-reads test15.qza
--o-classification taxonomy.qza

Here are my test data.
14rep.fa.txt (6.2 KB) 15rep.fa.txt (6.6 KB)

How did the representative reads influence the taxonomic classification?

Hi @mol,

It could be there are not many reference reads that match these particular Archaea sequences in GreenGenes? Also, it could be that there are legitimate Archaea references in GreneGenes, but the extract-reads step might have up excluded too many Archaea as a result of mismatches between the representative sequences and the primer sequences. Thus affecting the classifier? I am not really sure. Perhaps others have better insight.

Meanwhile, I would try a few simple things to sanity-check this issue:

  1. Use the full-length 16S Greengenes classifier to ensure that the extract-reads step did not result in the removal important sequences that are affecting the classifier. If you obtain good hits to all the full length versions, then the extract-reads step may be an issue.
  2. Try the prototype full-length SILVA 138 classifiers located here, or the older SILVA 132 version here.
  3. If you get reasonable hits in step 2, then move on to trying the extract-reads step on the SILVA reference sets and see if you can optimize your classifications.

Let us know how this goes. :man_factory_worker:


1 Like

Hi @mol,
Just a couple things to add to @SoilRotifer’s advice.

The sklearn classifier should not give different classifications each time you run it… unless if your sequences are in mixed orientations, which confuses the classifier as described here:

So that explains why removing one sequence causes the results to change, and also probably why a few of these classify as Archaea and the others as Bacteria.

Even though you are using extract-reads, you are probably hitting a few bacterial sequences. classify-sklearn would not give bacterial classifications unless if this annotation is present in the (trimmed) reference database that you are using.

So in addition to @SoilRotifer’s advice about starting with the full-length database, I’d advise trying to orient your sequences in the same direction. Unfortunately QIIME 2 can’t do that for you right now. Another option is to use classify-consensus-vsearch, which is able to handle mixed-orientation sequences.

1 Like

The result of using full-length 16S Greengenes classifier was wrong.
I used SILVA 138 classifiers instead of Greengenes, and found many archaea in the result.

1 Like

Thanks for getting back to us @mol! I’m glad we were able to help. :man_technologist: