Hi @mol,
It could be there are not many reference reads that match these particular Archaea sequences in GreenGenes? Also, it could be that there are legitimate Archaea references in GreneGenes, but the extract-reads
step might have up excluded too many Archaea as a result of mismatches between the representative sequences and the primer sequences. Thus affecting the classifier? I am not really sure. Perhaps others have better insight.
Meanwhile, I would try a few simple things to sanity-check this issue:
- Use the full-length 16S Greengenes classifier to ensure that the
extract-reads
step did not result in the removal important sequences that are affecting the classifier. If you obtain good hits to all the full length versions, then theextract-reads
step may be an issue. - Try the prototype full-length SILVA 138 classifiers located here, or the older SILVA 132 version here.
- If you get reasonable hits in step 2, then move on to trying the
extract-reads
step on the SILVA reference sets and see if you can optimize your classifications.
Let us know how this goes.
-Mike