I think I'm a little confused regarding taxonomic classifiers and their use for analysis. I actually finished the "moving pictures tutorial" and our lab uses the same region of analysis so using the pre-trained classifier from the tutorial fit our work pretty well (I understand that this isn't ideal).
This equivalent step as I understand it is in QIIME 1.9.1 open/closed OTU picking using a greengenes. However, since DADA2 accomplishes the clustering step outside of the OTU picking/closed OTU picking, taxonomic assignment is independent of clustering, thus the taxonomy is assigned after the clusters are created.
Questions:
If I wanted to re-train, would the training for classifiers be on an per sequence run basis? per project basis (multiple sequencing runs)? or per project (only sequences from the project samples from multiple sequencing runs)?
If I wanted to re-train, what is the expectation on the improvement on taxonomic classification compared to pre-trained classifiers? 5%-10%? 50%?
Feature classifiers are all different, there are several questions as to which is the best, would evaluating each and every classifier for the experiment be important? Or is it as important as just working and sticking to a method?
Once I train a feature classifier, could I reference it for other projects? Let's say a feature classifier I train for mouse sets, could I use that same feature classifier for microbiome human data?
If I used the pre-trained classifiers, I lose a little resolution (e.g., some sequences won't be annotated correctly because I haven't "trained" the classifiers to my primers?)
Thank you if these questions were asked elsewhere, I reviewed the white paper briefly.
That's correct, clustering is first, then taxonomy assignment is a second independent step. This was also true in Qiime 1 (pick_otus.py then assign_taxonomy.py).
You need to retrain the sklearn-classifier on a per-database basis. So you can train it once on greengenes may 15 2013, and use it on all samples, runs, experiments, everything. But if you switch to Silva database v123, you would have retrain.
Ideally, using the same setting on the same database would produce identical results. You might see big gains by using better (larger, more specific) database.
Which classifier one is best? You found the paper on how the Qiime devs think about taxonomy optimization. Personally I'm not too worried about it because 1) functionally is more important than taxonomy and 2) it's hard to infer taxonomy from 250 bp of DNA.
I wouldn't site a classifier... I would site the database it was trained on. If you built a new database, you could absolutely site that. Databases get lots of citations.
The pre-trained classifiers should work OK, but you can validate this by looking at the taxonomy composition of your positive controls.
Let me know if that helps. Let's see if the Qiime devs have any other suggestions or corrections about using the taxonomy classifiers in Qiime 2.
If you are using the same primers, then this is ideal.
Not exactly. Clustering and taxonomy assignment should be kept separate. Even with closed-ref OTU picking you should still attempt to reassign taxonomy, as closed-ref OTU picking is just aligning to the top hit and does not take any kind of confidence metric into account (e.g., other near-top-hits could have different taxonomic labels!). As @colinbrislawn said:
Define "best". Unless if you have a mock community with known composition in your run, you do not know the true taxonomic composition and cannot assess whether one classification is better than another. So you are best any one of the classifiers in q2-feature-classifier — see here. For parameters, just use the default settings unless if you have some particular goals in mind (in which case see the alternative parameter recommendations in Table 2 of the paper you linked to).
The classifier is trained on a specific set of reference sequences, and I imagine you will use the same reference sequences for both human and mouse samples, unless if you have some kind of host-specific reference sequences.
As Colin said:
So same database, same classifier.
Do you mean that you want to cite the classifier that you trained for a separate project? Cite the reference database and cite the classification method that you used, but don't cite the other project.
If you used the same primers, then you have nothing to worry about. See the notes here.