It seemingly worked great, however one of my colleagues that is interested in copepods was using the database that I downloaded via RESCRIPt and was dismayed by the number of copepod sequences in the database. This was curious (and a bit concerning), because we know that by looking at the Silva Browser online there are many copepods. After she brought this to my attention, I downloaded the Silva 138.1 SSU Ref taxonomy file directly from the Silva archive for a little comparison.
First, I exported the RESCRIPt-pulled version of Silva.
Sooo..... despite having the same total number of sequences, for some reason a whole host of copepod sequences are either missing or perhaps their taxonomies aren't parsed correctly and thus falsely appear to be missing.
Has any one else experienced this and come up with a solution?
There are a couple of reasons for this apparent disparity:
The command you ran will by default only write-out a 7-rank taxonomy. That is: domain, phylum, class, order, family, genus. Although rank-propagation is on by default, the ranks written out at the end of the pipeline may not include those intervening ranks with the label Copepoda. That is what is happening in this case. See below.
You can specify which ranks you'd like to download by using the --p-ranks flag. That is, we can list all SILVA available ranks like so: --p-ranks domain superkingdom kingdom subkingdom superphylum phylum subphylum infraphylum superclass class subclass infraclass superorder order suborder superfamily family subfamily genus.
Somewhat related, if you were to use the pre-generated files from the Data resources page there might be more inconsistencies due to the SILVA data used (NR99) and the QA/QC approach we took. For the sake of completeness for future readers the general example pipeline is here.
To sanity check that the parsing is working as expected I ran the commands below. Note: added --p-no-download-sequences to save time by not downloading the sequences. The resulting seq file will be empty.
As you can see, all the Copepoda data is indeed captured and parsed by RESCRIPt as in your grep example (1821 sequences). You just have to specify the ranks you want to write out to file. Different reference databases by go by various rules, and list groups like Copepoda under different ranks. SILVA lists Copepoda under the sub-class (cs__), infra-class (ci__), and super-order (so__), e.g.:
As you can see from the extracted taxonomic examples above... the information was there all along, but not available within the standard 7-rank taxonomy output. So, just set the ranks you want
Note: the more rank information you keep the longer it'll take to make your classifier, and it'll be larger too.
Hi @SoilRotifer - Thank you for your quick response and helpful explanation. I had a feeling it had something to do with RESCRIPt pulling the standard ranks useful for prokaryotes by default which proves problematic with the more complex eukaryotic taxonomies.
So, I should be able to use the commands you shared to grab the full taxonomic string for each reference sequence and use that file instead and that'll fix the issue for my eukaryote-loving colleagues.
Thanks for this . I followed that pipeline exactly (so helpful! thank you)...but just pretty much did the defaults, thus yielding fewer ranks than I apparently needed. On the data resource page, is the Silva database 138 or 138.1? Regardless, for this project, my colleagued wanted the Full Rep database and not the NR99 one, so I couldn't use the pre-generated files this time around.
According to the provenance for the classifiers (just drag and drop the QZA file onto https://view.qiime2.org/, and click "Provenance") they were made with 138. The default for the latest version of RESCRIPt is to download 138.1.
Hi @SoilRotifer - I decided to rerun the RESRIPt pipeline to grab more ranks and to track back where some of the sequences we are interested in are falling out (wasn't sure if i could just use the sequences i grabbed before, but with the new taxonomy file...so just in case, starting fresh). But, I'm stuck at step 1 (!!) because it seems the sequences have imported as FeatureData[RNASequence] and not FeatureData[Sequence].
For reference, this is what I ran:
qiime rescript get-silva-data \
--p-include-species-labels \
--p-version 138.1 \
--p-target SSURef \
--p-ranks domain superkingdom kingdom superphylum phylum subphylum class subclass order family subfamily genus \
--o-silva-sequences silva-138-1-ref-moreranks-seqs.qza \
--o-silva-taxonomy silva-138-1-ref-moreranks-tax.qza
Have you ever experienced this before? I haven't when I've run it in the past, but this is my first time doing it with QIIME2 v 2021.8.
The pipeline rescript get-silva-data will download and import the sequence data as FeatureData[RNASequence], as this is how the data is stored at SILVA. Many users will simply run rescript cull-seqs after import which can take in either FeatureData[RNASequence] or FeatureData[Sequence], but output only FeatureData[Sequence]. But if you'd like to skip that step, or process the data in a different way, then you'll have to run rescript reverse-transcribe.
I realized I had not included this step in the "Getting SILVA data: Hard Mode" or "Easy Mode" sections of the tutorial. So, I've updated those parts to include this step.
Thank you for asking these questions! It helps us keep our documentation up-to-date and clear!
Interestingly, the cull-seqs step is where I seemed to be loosing a lot of eukaryotic sequences (at least for zooplankton), and it wasn't necessarily a naming / taxonomic rank issue as I had thought (although it is now nice to have some more ranks in there for the Eukaryotes).
Several of the copepod species I looked at have 8 bp homopolymers in their 18S sequence in the database and so they were deleted in this step. Bumped --p-homopolymer-length up to 9, and was able to retain at least one rep per species of interest through the pipeline. Just posting this here in case anyone ever happens to find themselves in a similar situation.
Thank for that info @ctekellogg! It is for this reason that we created RESCRIPt! Everyone has slightly different requirements for their work. We feel it is better for the researcher to be able to make a database that best suits their needs.
Keep sharing! In fact, when you feel that you have worked out a good RESCRIPt workflow for your project, feel free to post that entire workflow in the forum along with your thoughts.