SILVA taxonomy-classifier clarification

Nicholas_Bokulich · July 17, 2018, 3:54pm

You could have looked in the overview tutorial or classifier training tutorial, which both give examples (the latter shows a workflow for importing and getting to blast, the latter has actual files or greengenes in the expected formats and import examples).

Sounds like you are conflating the SILVA database construction with actual taxonomy assignment.

The classify-consensus-blast command uses blast+ for database searching, followed by LCA taxonomy consensus assignment in QIIME2. You can read more about how that method works and its parameters in the publication for q2-feature-classifier

The alignment-related parameters come directly from blastn — you can read the blastn manual for additional details.

Sounds like you are talking about which SILVA rep_set to use, not QIIME 2 parameters.

Sounds like you've read the readme that comes with the SILVA qiime-compatible release. I'd recommend the 99% OTUs rep set, as this has the greatest specificity. I prefer the majority taxonomy, since some labels can be incorrect and that will cause problems if you are looking for 100% consensus.

Again, you are conflating database building with taxonomy assignment. You are just making a selection of database OTU clusters (and their matching taxonomies), there is not really a complicated decision to make at this stage involving multiple parameters (see below). There is one decision to make: what level of OTU clustering do I wish to use? Higher will mean more specific OTUs, more specific taxonomies, but also many more reference sequences (leading to longer runtime and memory requirements). So the decision is easy: always use 99% unless if you are unable to do so due to computational limitations.

Yes, you need to choose the matching files. The taxonomy files are just the taxonomy labels that correspond to the different rep sets — they have not been clustered independently. The consensus vs. majority taxonomies are based on the raw taxonomies of sequences that are clustered together. So the different taxonomy files contain different IDs and potentially different taxonomy labels for any IDs that are shared between these files (because OTU clusters will be tighter at 99% than 94%, for example, and lead to shallower consensus/majority taxonomies). But most importantly you need the IDs to match or the taxonomy classification will not work.

I hope that clarifies!