I would just like to know if making two separate classifiers from the same database makes a difference to the accuracy of classification.
For instance, I want to get rid of "non-microbial" reads from my sequences and process only the microbial reads. What I usually do is make a classifier with sequences that I want to remove (e.g., Fungi, Animalia) and then run classification and filter the classified "non-microbe" sequences. After that, I will make a classifier from the pool of reference sequences devoid of the "non-microbe" sequences, and then classify them using this classifier (which will be considered as the final 'microbe-only' dataset.
I am not running using an HPC cluster so I am trying to make my run more manageable on my laptop.
The bottom line is--am I comprising the quality of the results?
I think what you are proposing sounds reasonable. Though I'd leave in at least a handful of microbial sequences within your "non-microbe" classifier, and vice versa for your "microbe" classifier. Just in case something is incorrectly classified at the Domain level. That is, there should always be several outgroup taxa (domains in this case) in your classifier.
There are some other avenues you might try to exclude non-microbial reads:
Perhaps run one of the above two procedures... then only add a handful of representative non-microbial taxa to your microbial classifier as the representative outgroups.
How are you making your reference database? Have you tried the RESCRIPt plugin? The tutorials hint at a few ways you can reduce the size of the reference database, e.g. dereplicate the reference sequences, use the amplicon region.
Thanks for this helpful response.
I've been using the RESCRIPt plugin that you made in preparing the classifiers and I also follow the tutorials you made. It is really helpful and makes everything faster.
I'll try doing your suggestions and look if there is a significant difference in the classification.
Again, thanks, and more power to open Science!