RESCRIPt - missing sequences or mis-formatted taxonomies?

ctekellogg · October 14, 2021, 5:50pm

Hello!

A few months ago I used RESCRIPt to download the Silva 138.1 SSU Ref database.

qiime rescript get-silva-data \
    --p-version '138.1' \
    --p-target 'SSURef' \
    --p-include-species-labels \
    --o-silva-sequences silva-138-1-ssu-seqs.qza \
    --o-silva-taxonomy silva-138-1-ssu-tax.qza

It seemingly worked great, however one of my colleagues that is interested in copepods was using the database that I downloaded via RESCRIPt and was dismayed by the number of copepod sequences in the database. This was curious (and a bit concerning), because we know that by looking at the Silva Browser online there are many copepods. After she brought this to my attention, I downloaded the Silva 138.1 SSU Ref taxonomy file directly from the Silva archive for a little comparison.

First, I exported the RESCRIPt-pulled version of Silva.

qiime tools export --input-path silva-138-1-ssu-tax.qza --export-path export

Then I downloaded the equivalent file from Silva ((https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/taxonomy/taxmap_slv_ssu_ref_138.1.txt.gz)

and now from some comparisons (note: taxonomy.tsv is the qiime2-rescript version of the silva 138.1 taxonomy:

grep -c ^ taxonomy.tsv.
2224741

zgrep -c ^ taxmap_slv_ssu_ref_138.1.txt.gz
2224741

So, each version of Silva 138.1 has the same number of sequences.

grep -c taxonomy.tsv -e Copepoda
51

zgrep -c taxmap_slv_ssu_ref_138.1.txt.gz -e Copepoda
1821

Sooo..... despite having the same total number of sequences, for some reason a whole host of copepod sequences are either missing or perhaps their taxonomies aren't parsed correctly and thus falsely appear to be missing.

Has any one else experienced this and come up with a solution?

Thanks a million!!

Colleen

SoilRotifer · October 14, 2021, 7:48pm

Very good questions @ctekellogg!

There are a couple of reasons for this apparent disparity:

The command you ran will by default only write-out a 7-rank taxonomy. That is: domain, phylum, class, order, family, genus. Although rank-propagation is on by default, the ranks written out at the end of the pipeline may not include those intervening ranks with the label Copepoda. That is what is happening in this case. See below.
- You can specify which ranks you'd like to download by using the --p-ranks flag. That is, we can list all SILVA available ranks like so: --p-ranks domain superkingdom kingdom subkingdom superphylum phylum subphylum infraphylum superclass class subclass infraclass superorder order suborder superfamily family subfamily genus.
Somewhat related, if you were to use the pre-generated files from the Data resources page there might be more inconsistencies due to the SILVA data used (NR99) and the QA/QC approach we took. For the sake of completeness for future readers the general example pipeline is here.

To sanity check that the parsing is working as expected I ran the commands below. Note: added --p-no-download-sequences to save time by not downloading the sequences. The resulting seq file will be empty.

qiime rescript get-silva-data \
    --o-silva-sequences all-rank-full-silva-138-1-seqs.qza \
    --o-silva-taxonomy all-rank-full-silva-138-1-tax.qza \
    --p-ranks domain superkingdom kingdom subkingdom superphylum phylum subphylum infraphylum superclass class subclass infraclass superorder order suborder superfamily family subfamily genus \
    --p-include-species-labels 
    --p-target SSURef \
    --p-no-download-sequences 

qiime tools export \
    --input-path all-rank-full-silva-138-1-tax.qza \
    --output-path all-rank-full-silva-138-1-tax-export

grep -c all-rank-full-silva-138-1-tax-export/taxonomy.tsv -e Copepoda
1821

As you can see, all the Copepoda data is indeed captured and parsed by RESCRIPt as in your grep example (1821 sequences). You just have to specify the ranks you want to write out to file. Different reference databases by go by various rules, and list groups like Copepoda under different ranks. SILVA lists Copepoda under the sub-class (cs__), infra-class (ci__), and super-order (so__), e.g.:

KF130106.1.1774 d__Eukaryota; sk__Holozoa; k__Animalia; ks__Animalia; sp__Ecdysozoa; p__Arthropoda; ps__Crustacea; pi__Crustacea; sc__Crustacea; c__Maxillopoda; cs__Copepoda; ci__Copepoda; so__Copepoda; o__Monstrilloida; os__Monstrilloida; sf__Monstrilloida; f__Monstrilloida; fs__Monstrilloida; g__Monstrilloida; s__uncultured_eukaryote

AF530545.1.1712 d__Eukaryota; sk__Holozoa; k__Animalia; ks__Animalia; sp__Ecdysozoa; p__Arthropoda; ps__Crustacea; pi__Crustacea; sc__Crustacea; c__Maxillopoda; cs__Copepoda; ci__Copepoda; so__Copepoda; o__Siphonostomatoida; os__Siphonostomatoida; sf__Siphonostomatoida; f__Siphonostomatoida; fs__Siphonostomatoida; g__Siphonostomatoida; s__Copepoda_environmental

As you can see from the extracted taxonomic examples above... the information was there all along, but not available within the standard 7-rank taxonomy output. So, just set the ranks you want

Note: the more rank information you keep the longer it'll take to make your classifier, and it'll be larger too.

I hope this helps.

ctekellogg · October 14, 2021, 8:09pm

Hi @SoilRotifer - Thank you for your quick response and helpful explanation. I had a feeling it had something to do with RESCRIPt pulling the standard ranks useful for prokaryotes by default which proves problematic with the more complex eukaryotic taxonomies.

So, I should be able to use the commands you shared to grab the full taxonomic string for each reference sequence and use that file instead and that'll fix the issue for my eukaryote-loving colleagues.

Thanks for this . I followed that pipeline exactly (so helpful! thank you)...but just pretty much did the defaults, thus yielding fewer ranks than I apparently needed. On the data resource page, is the Silva database 138 or 138.1? Regardless, for this project, my colleagued wanted the Full Rep database and not the NR99 one, so I couldn't use the pre-generated files this time around.

Thanks again!!!!

Colleen

SoilRotifer · October 14, 2021, 8:15pm

No problem! I'm glad it helped!

According to the provenance for the classifiers (just drag and drop the QZA file onto https://view.qiime2.org/, and click "Provenance") they were made with 138. The default for the latest version of RESCRIPt is to download 138.1.

ctekellogg · October 15, 2021, 9:13am

Hi @SoilRotifer - I decided to rerun the RESRIPt pipeline to grab more ranks and to track back where some of the sequences we are interested in are falling out (wasn't sure if i could just use the sequences i grabbed before, but with the new taxonomy file...so just in case, starting fresh). But, I'm stuck at step 1 (!!) because it seems the sequences have imported as FeatureData[RNASequence] and not FeatureData[Sequence].

For reference, this is what I ran:

    qiime rescript get-silva-data \
    --p-include-species-labels \
    --p-version 138.1 \
    --p-target SSURef \
    --p-ranks domain superkingdom kingdom superphylum phylum subphylum class subclass order family subfamily genus \
    --o-silva-sequences silva-138-1-ref-moreranks-seqs.qza \
    --o-silva-taxonomy silva-138-1-ref-moreranks-tax.qza

Have you ever experienced this before? I haven't when I've run it in the past, but this is my first time doing it with QIIME2 v 2021.8.

Thanks,
Colleen

SoilRotifer · October 15, 2021, 3:07pm

Hi @ctekellogg,

The pipeline rescript get-silva-data will download and import the sequence data as FeatureData[RNASequence], as this is how the data is stored at SILVA. Many users will simply run rescript cull-seqs after import which can take in either FeatureData[RNASequence] or FeatureData[Sequence], but output only FeatureData[Sequence]. But if you'd like to skip that step, or process the data in a different way, then you'll have to run rescript reverse-transcribe.

I realized I had not included this step in the "Getting SILVA data: Hard Mode" or "Easy Mode" sections of the tutorial. So, I've updated those parts to include this step.

Thank you for asking these questions! It helps us keep our documentation up-to-date and clear!

ctekellogg · October 15, 2021, 4:17pm

Oh! Somehow I had no idea that was the way it was imported! (and didn't notice it last time!). Thank you so much for the clarification.

ctekellogg · October 15, 2021, 11:20pm

Interestingly, the cull-seqs step is where I seemed to be loosing a lot of eukaryotic sequences (at least for zooplankton), and it wasn't necessarily a naming / taxonomic rank issue as I had thought (although it is now nice to have some more ranks in there for the Eukaryotes).

Several of the copepod species I looked at have 8 bp homopolymers in their 18S sequence in the database and so they were deleted in this step. Bumped --p-homopolymer-length up to 9, and was able to retain at least one rep per species of interest through the pipeline. Just posting this here in case anyone ever happens to find themselves in a similar situation.

Thanks again for your help!
Colleen

SoilRotifer · October 18, 2021, 3:42pm

Thank for that info @ctekellogg! It is for this reason that we created RESCRIPt! Everyone has slightly different requirements for their work. We feel it is better for the researcher to be able to make a database that best suits their needs.

Keep sharing! In fact, when you feel that you have worked out a good RESCRIPt workflow for your project, feel free to post that entire workflow in the forum along with your thoughts.

Anytime!

sdavid · October 21, 2021, 2:54pm

HI Colleen
Thanks for your answer

I'll also look for my bacteroides

Sandrine