SILVA taxonomy-classifier clarification

Kara · July 17, 2018, 1:46pm

We had to dig around a little to find an example of how to import and use the rep_set and taxonomy files from the SILVA_132_release classifier in QIIME2. We downloaded the 132 release files and then imported the 99% rep_set and 99% taxonomy files (majority_ or consensus_taxonomy_7_levels) into QIIME2. We were then able to run the command below in QIIME2. There are so many variables to change in this final command, and we wanted to double-check that we understand all of the options (there are a lot of "behind-the-scenes" steps going on!):

The first variable seems to be the rep_set. We get to choose the rep_set % as either 90, 94, 97, or 99 for 16S only data. It sounds like the standard is 97% or above, and from what we understand, this determines the percent sequence similarity for clustering. For example, it starts with one long sequence as a "seed" in the cluster and compares a second sequence to it. If the second sequence is 97%+ similar to the first, they will be clustered together (they'll be assigned a taxonomy later). If it's less than 97% similar it will become it's own "seed" in a cluster. This continues with the 3rd, 4th, etc. sequences until all sequences have been clustered based on this rep_set percentage threshold. Is that correct?

Now that the clusters are made, taxonomy can be assigned. We get to choose the taxonomy % as either 90, 94, 97, or 99 as well. Within those options we can then choose our desired consensus_taxonomy_7_levels or majority_taxonomy_7_levels (for us, the choice between consensus and majority didn't impact how many reads came up as unassigned at the end). We understand consensus to mean that all potential taxa strings for a cluster are identical (100% the same), whereas majority means they are 90% similar. We aren't quite sure then, how you could pick for example the 94 taxonomy folder and then choose the consensus_taxonomy option inside. What do these 90, 94, 97, 99 %s mean if it's not related to the taxa string similarities within a cluster? Do we need to choose the same % for the taxonomy file as the rep_set file? Can they be different?

The last variable seems to be the --p-perc-identity 0.98. This seems the easiest to understand. It's the minimum percent similarity we want between OUR unknown sequences and the BLAST sequences. Yes?

Phew---long question......sorry! Hoping for any clarification of our questions above!

Nicholas_Bokulich · July 17, 2018, 3:54pm

You could have looked in the overview tutorial or classifier training tutorial, which both give examples (the latter shows a workflow for importing and getting to blast, the latter has actual files or greengenes in the expected formats and import examples).

Sounds like you are conflating the SILVA database construction with actual taxonomy assignment.

The classify-consensus-blast command uses blast+ for database searching, followed by LCA taxonomy consensus assignment in QIIME2. You can read more about how that method works and its parameters in the publication for q2-feature-classifier

The alignment-related parameters come directly from blastn — you can read the blastn manual for additional details.

Sounds like you are talking about which SILVA rep_set to use, not QIIME 2 parameters.

Sounds like you've read the readme that comes with the SILVA qiime-compatible release. I'd recommend the 99% OTUs rep set, as this has the greatest specificity. I prefer the majority taxonomy, since some labels can be incorrect and that will cause problems if you are looking for 100% consensus.

Again, you are conflating database building with taxonomy assignment. You are just making a selection of database OTU clusters (and their matching taxonomies), there is not really a complicated decision to make at this stage involving multiple parameters (see below). There is one decision to make: what level of OTU clustering do I wish to use? Higher will mean more specific OTUs, more specific taxonomies, but also many more reference sequences (leading to longer runtime and memory requirements). So the decision is easy: always use 99% unless if you are unable to do so due to computational limitations.

Yes, you need to choose the matching files. The taxonomy files are just the taxonomy labels that correspond to the different rep sets — they have not been clustered independently. The consensus vs. majority taxonomies are based on the raw taxonomies of sequences that are clustered together. So the different taxonomy files contain different IDs and potentially different taxonomy labels for any IDs that are shared between these files (because OTU clusters will be tighter at 99% than 94%, for example, and lead to shallower consensus/majority taxonomies). But most importantly you need the IDs to match or the taxonomy classification will not work.

I hope that clarifies!

Kara · July 18, 2018, 7:06pm

Yes, that absolutely clarifies! We had no idea that we had to choose the same "99" number for the taxonomy folder as the "99" number for the rep_set. We couldn't understand what the "99" would even mean for taxonomy! This helps a LOT! Thanks very much for deciphering what we were actually trying to ask, and then answer it . We are VERY new to this---so it's hard to even put our questions into the correct phrasing sometimes. Working on it.....

system · August 19, 2018, 1:06am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.