Great to see someone dive into the weeds!
As one of the contributors to this pipeline, let me see if I can help you out.
Correct. However, I'd recommend keeping all of the reference sequences, including the eukaryotes, in the files. It allows you to more robustly identify off-target sequences, i.e. non-bacterial and non-archaeal sequences. Often the different versions of the database, i.e. 16S only, was more for practical reasons to reduce the memory and storage footprint of the reference database.
Correct.
Looks like you have it all figured out.
Yep!
Yes. the 99% helps removes some extra noise in the reference data, and reduce the size of the data set. We currently (see below) prefer to use the SILVA NR99 dataset for the reasons outlined here.
We'll come back to this. We have a treat for you at the end.
But, great question! You get a !
Again, traditionally, clustering was a way to remove noisy sequence data, and reduce the size of the reference set to be used for taxonomic classification. Back in the day, many researchers did not have access to good computing resources that had the memory and cpu power to construct classifiers.
Just to reduce the file size. Once we have the taxonomy file there is no need to keep that redundant information.
Correct. The original old code referenced in the README is here. A description on the labeling format is available here:
However, we have since taken a different approach to parsing taxonomy. See the end of this post.
Yep.
Here, the "cluster" is the set of sequences that fall into an OTU. That is, a representative sequence. We are forming the consensus / majority taxonomy by taking into account all of the lineages that fall within that OTU / representative sequence, and collapsing the taxonomy that is, hopefully, a good representation of all the sequences that fall into that cluster.
Not a problem. Not many have dug deep into the process of generating these files. I for one appreciate your interest! Thank you!
Okay... so that bit of I was promising... Instead of re-working your way through that old pipeline, you can simply make use RESCRIPt, to make your own SILVA reference database, even for version 132! You can work through the tutorials and curate the reference data the way you'd like. That is you can simply run this command:
qiime rescript get-silva-data \
--p-version 132 \
--p-target SSURef_NR99 \
--p-include-species-labels \
--output-dir silva-132
and then follow the rest of the SILVA tutorial. I hope you'll find that it is superior to what we've done in the past. For example, how we parse taxonomy, etc... Although we provide a few ready-to-use files and classifiers, you can certainly go ahead and make your own files, curated the way you'd like . The goal of RESCRIPt is to make life a little easier for those of us interested in constructing and curating our own little piece of reference database heaven.
As a little history... the old SILVA parsing code, that I linked above, was eventually updated to this, and then found a home in RESCRIPt.
Take it for a spin and let us know how it works out.