However, it is failing to remove all of the taxa as it should based on the keywords. I’ve done some digging, and I think the problem is that not all of the levels are named in the taxonomy table - it skips levels. For example, one taxon that should be filtered but isn’t is written in the table as
D_0__Eukaryota;D_1__Opisthokonta;D_2__Holozoa;D_3__Metazoa (Animalia);D_10__Neoptera;D_11__Coleoptera.
It should be filtered out by the Arthropoda keyword, but it skips that level of the taxonomy in the label. I think this is probably because I’m using a Silva classifier (since it’s 18S), which has many more levels than greengenes. Is there something I should be including in my feature classifier step that gets it to maintain the full taxonomy? Or is there any other fix you can suggest besides individually typing each species that I want removed?
Great, I’ll try that! Just to make sure I do this right, though, is there something specific that I would need to do differently (compared to how the pre-trained classifiers were made) to make sure that it maintains the full taxonomy? Because, to me, it seems like the problem might be in the classification step as opposed to the classifier itself - it is assigning taxonomy at high resolution, it’s just not saving the whole taxa name in the taxonomy table it makes.
The problem is from the database, not from the classifier. In the SILVA 7-level taxonomy file not all sequences have the same taxonomy levels shown, so those taxonomic levels are missing in the raw data. The classifier can't report those levels if they are not in the raw data...
However, using the full database will result in an error with the classify-sklearn method, since it has an uneven number of taxonomic levels. So you will either need to use another classification method, such as classify-consensus-vsearch, or figure out how to fix the 7-level taxonomy!
Thanks so much for your help! I was able to train my own classifier with the full SILVA taxonomy instead of the SILVA-7 taxonomy and that solved my problem. I hadn’t realize that the pre-trained classifier used the 7-level SILVA, so that was really my problem.