Sklearn-classifier bug for archaea 16s amplicons

SoilRotifer · January 30, 2020, 7:17pm

It could be there are not many reference reads that match these particular Archaea sequences in GreenGenes? Also, it could be that there are legitimate Archaea references in GreneGenes, but the extract-reads step might have up excluded too many Archaea as a result of mismatches between the representative sequences and the primer sequences. Thus affecting the classifier? I am not really sure. Perhaps others have better insight.

Meanwhile, I would try a few simple things to sanity-check this issue:

Use the full-length 16S Greengenes classifier to ensure that the extract-reads step did not result in the removal important sequences that are affecting the classifier. If you obtain good hits to all the full length versions, then the extract-reads step may be an issue.
Try the prototype full-length SILVA 138 classifiers located here, or the older SILVA 132 version here.
If you get reasonable hits in step 2, then move on to trying the extract-reads step on the SILVA reference sets and see if you can optimize your classifications.

Let us know how this goes.

-Mike