RESCRIPt - missing sequences or mis-formatted taxonomies?

SoilRotifer · October 14, 2021, 7:48pm

Very good questions @ctekellogg!

There are a couple of reasons for this apparent disparity:

The command you ran will by default only write-out a 7-rank taxonomy. That is: domain, phylum, class, order, family, genus. Although rank-propagation is on by default, the ranks written out at the end of the pipeline may not include those intervening ranks with the label Copepoda. That is what is happening in this case. See below.
- You can specify which ranks you'd like to download by using the --p-ranks flag. That is, we can list all SILVA available ranks like so: --p-ranks domain superkingdom kingdom subkingdom superphylum phylum subphylum infraphylum superclass class subclass infraclass superorder order suborder superfamily family subfamily genus.
Somewhat related, if you were to use the pre-generated files from the Data resources page there might be more inconsistencies due to the SILVA data used (NR99) and the QA/QC approach we took. For the sake of completeness for future readers the general example pipeline is here.

To sanity check that the parsing is working as expected I ran the commands below. Note: added --p-no-download-sequences to save time by not downloading the sequences. The resulting seq file will be empty.

qiime rescript get-silva-data \
    --o-silva-sequences all-rank-full-silva-138-1-seqs.qza \
    --o-silva-taxonomy all-rank-full-silva-138-1-tax.qza \
    --p-ranks domain superkingdom kingdom subkingdom superphylum phylum subphylum infraphylum superclass class subclass infraclass superorder order suborder superfamily family subfamily genus \
    --p-include-species-labels 
    --p-target SSURef \
    --p-no-download-sequences 

qiime tools export \
    --input-path all-rank-full-silva-138-1-tax.qza \
    --output-path all-rank-full-silva-138-1-tax-export

grep -c all-rank-full-silva-138-1-tax-export/taxonomy.tsv -e Copepoda
1821

As you can see, all the Copepoda data is indeed captured and parsed by RESCRIPt as in your grep example (1821 sequences). You just have to specify the ranks you want to write out to file. Different reference databases by go by various rules, and list groups like Copepoda under different ranks. SILVA lists Copepoda under the sub-class (cs__), infra-class (ci__), and super-order (so__), e.g.:

KF130106.1.1774 d__Eukaryota; sk__Holozoa; k__Animalia; ks__Animalia; sp__Ecdysozoa; p__Arthropoda; ps__Crustacea; pi__Crustacea; sc__Crustacea; c__Maxillopoda; cs__Copepoda; ci__Copepoda; so__Copepoda; o__Monstrilloida; os__Monstrilloida; sf__Monstrilloida; f__Monstrilloida; fs__Monstrilloida; g__Monstrilloida; s__uncultured_eukaryote

AF530545.1.1712 d__Eukaryota; sk__Holozoa; k__Animalia; ks__Animalia; sp__Ecdysozoa; p__Arthropoda; ps__Crustacea; pi__Crustacea; sc__Crustacea; c__Maxillopoda; cs__Copepoda; ci__Copepoda; so__Copepoda; o__Siphonostomatoida; os__Siphonostomatoida; sf__Siphonostomatoida; f__Siphonostomatoida; fs__Siphonostomatoida; g__Siphonostomatoida; s__Copepoda_environmental

As you can see from the extracted taxonomic examples above... the information was there all along, but not available within the standard 7-rank taxonomy output. So, just set the ranks you want

Note: the more rank information you keep the longer it'll take to make your classifier, and it'll be larger too.

I hope this helps.