We had to dig around a little to find an example of how to import and use the rep_set and taxonomy files from the SILVA_132_release classifier in QIIME2. We downloaded the 132 release files and then imported the 99% rep_set and 99% taxonomy files (majority_ or consensus_taxonomy_7_levels) into QIIME2. We were then able to run the command below in QIIME2. There are so many variables to change in this final command, and we wanted to double-check that we understand all of the options (there are a lot of "behind-the-scenes" steps going on!):
The first variable seems to be the rep_set. We get to choose the rep_set % as either 90, 94, 97, or 99 for 16S only data. It sounds like the standard is 97% or above, and from what we understand, this determines the percent sequence similarity for clustering. For example, it starts with one long sequence as a "seed" in the cluster and compares a second sequence to it. If the second sequence is 97%+ similar to the first, they will be clustered together (they'll be assigned a taxonomy later). If it's less than 97% similar it will become it's own "seed" in a cluster. This continues with the 3rd, 4th, etc. sequences until all sequences have been clustered based on this rep_set percentage threshold. Is that correct?
Now that the clusters are made, taxonomy can be assigned. We get to choose the taxonomy % as either 90, 94, 97, or 99 as well. Within those options we can then choose our desired consensus_taxonomy_7_levels or majority_taxonomy_7_levels (for us, the choice between consensus and majority didn't impact how many reads came up as unassigned at the end). We understand consensus to mean that all potential taxa strings for a cluster are identical (100% the same), whereas majority means they are 90% similar. We aren't quite sure then, how you could pick for example the 94 taxonomy folder and then choose the consensus_taxonomy option inside. What do these 90, 94, 97, 99 %s mean if it's not related to the taxa string similarities within a cluster? Do we need to choose the same % for the taxonomy file as the rep_set file? Can they be different?
The last variable seems to be the --p-perc-identity 0.98. This seems the easiest to understand. It's the minimum percent similarity we want between OUR unknown sequences and the BLAST sequences. Yes?
Phew---long question......sorry! Hoping for any clarification of our questions above!