dereplicating with 'super' stripping all taxonomic information... why?

devonorourke · August 12, 2020, 1:17pm

The error I received that started me down this little rabbit hole began when I ran qiime rescript evaluate-cross-validate, and it looked like this:

Plugin error from rescript:

  unknown kingdom in query set: ;p__;c__;o__;f__;g__;s__

Debug info has been saved to /tmp/qiime2-q2cli-err-xumwidbw.log

Validation: 37.31s
/home/dorourke/miniconda/envs/rescript/lib/python3.6/site-packages/sklearn/model_selection/_split.py:657: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=3.
  % (min_groups, self.n_splits)), Warning)

Naturally, this was something that popped up only after I'd spent weeks building a COI database . A tiny bit of background first about how the final database came to be:

Reference sequences are pulled from BOLD using an R script.
These sequences are imported in to QIIME, and then filtered using a series of qiime rescript steps including:
- cull-seqs (removed seqs with > 4 N's, homopolymer runs > 11)
- filter-seqs-length (retained seqs between 250 - 1600 bp)
- dereplicate in super mode, with --p-derep-prefix

There were also a series of steps used to create a database from those filtered sequences that was trimmed to the primers I use. The empty taxonomic features with just k__Animalia;p__;c__;... - are present in both databases however, so I expect the same error message noted at the beginning of this post.

There were 32 instances where this taxonomy string was assigned:

tax=k__Animalia;p__;c__;o__;f__;g__;s__

These FeatureIDs with empty taxonomies aren't empty sequences either. I ran NCBI blast on a handful of sequences and got hits above 90% coverage and 98% identity - so they're not garbage (in fact, they're apparently different fish ) ...

The obvious thing to do here, I think, it to simply add an additional filtering step to remove this particular kind of string once dereplication is done - does that sound right?

This got me to thinking that some kind of taxonomic filtering should be part of qiime rescript dereplicate by default. What value would there be in retaining sequences that contain only domain-rank information? Nevertheless, I can imagine that databases using different marker genes (or different databases within the same marker gene too) might need different filtering steps. In any event, I think it's worth considering whether this level of information should be allowed, given that it's going to likely lead to a crash in the downstream family of evaluate functions in RESCRIPt.

Thanks for your thoughts @SoilRotifer @thermokarst @Nicholas_Bokulich!

Nicholas_Bokulich · August 12, 2020, 3:17pm

Hi @devonorourke,
This error:

is indicating that there is a taxonomy without any kingdom label. It is not choking on the ;p__;c__;o__;f__;g__;s__ part, it is choking on the lack of kingdom label.

My assumption is that this may be leaking in at this step:

but otherwise you could have discovered a bug in dereplicate. Could you check out how many sequences have this taxonomy before and after dereplicate?
;p__;c__;o__;f__;g__;s__

Thanks!

devonorourke · August 12, 2020, 3:36pm

If this was the case, I think I should find some hits with:

qiime tools export --input-path mytaxa.qza --output-path tmpdir_mytaxa
grep 'k__;p' ./tmpdir_mytaxa/taxonomy.tsv

alas, there are no hits.

However, if I modify the search term slightly:

grep 'p__;c__' ./tmpdir_mytaxa/taxonomy.tsv

I'll get results like this:

tax=k__Animalia;p__;c__;o__;f__;g__;s__

This indicates to me that I'm missing the phylum rank, not domain/kingdom rank, correct?

In any event, there are 32 missing labels among 1.7 million. Happy to try additional search terms if you'd like?

Nicholas_Bokulich · August 12, 2020, 3:40pm

no, because that has an empty kingdom rank handle ("k__")... the error message is indicating that there is nothing there. evaluate-cross-validate is detecting this specific string: ;p__;c__;o__;f__;g__;s__

so you could search for that, or something like:

grep '^;p__'  ./tmpdir_mytaxa/taxonomy.tsv

(note that will only work with some versions of grep... you may need to use grep -P etc to specify that this is a regex pattern, depending on which version of grep you are using)

devonorourke · August 12, 2020, 5:54pm

Interesting - thanks for the clarification.

Unfortunately, nothing is returned for:

grep '^;p__' ./tmpdir_mytaxa/taxonomy.tsv

And even if I try to figure out what might be outside of the three expected Kindom terms with this search:

cut -f 2 taxonomy.tsv | grep -v '^tax=k__Animalia;' | grep -v '^tax=k__Protozoa' | grep -v '^tax=k__Fungi' -m 2

it returns just the header:

Taxon

Nothing else!

So I don't think the label we're looking for exists in the data

Any further ideas?

Nicholas_Bokulich · August 12, 2020, 5:57pm

want to send me your QZAs via email and I will take a look?

devonorourke · August 12, 2020, 6:01pm

Can't we get Greg to use that fancy new NCI money and get us a server to share all these files?

You should be able to get the file here: https://osf.io/cn3p7/download

Nicholas_Bokulich · August 12, 2020, 6:48pm

Thanks @devonorourke,

I am not able to replicate this error — everything is running fine (to be clear, I am just testing the taxonomy stratification part that is failing, not evaluate-cross-validate).

Could you please make sure you have the latest version of RESCRIPt installed? If you re-run, please make sure you are using the same file you sent me, and post the full command and error message (use --verbose mode).

devonorourke · August 13, 2020, 10:25am

The error was generated with this version:

QIIME 2 Plugin 'rescript' version 2020.6.1+2.g24b7605 (from package 'rescript' version 2020.6.1+2.g24b7605)

I upgraded to this:

Installing collected packages: rescript
  Attempting uninstall: rescript
    Found existing installation: rescript 2020.6.1+2.g24b7605
    Uninstalling rescript-2020.6.1+2.g24b7605:
      Successfully uninstalled rescript-2020.6.1+2.g24b7605
Successfully installed rescript-2020.6.1+3.g39f608e

Was going to rerun today, but wanted to ask a question first:
Is there a way to use qiime taxa filter-seqs to remove this kind of potential empty taxonomic string? Because my seqs have different Kingdom labels (Animalia, Fungi, Protozoa), I can't use a single --p-include term. I was reading the help menu, but wasn't clear on how to incorporate multiple terms at once. From the help menu:

Parameters:
  --p-include TEXT        One or more search terms that indicate which taxa
                          should be included in the resulting sequences. If
                          providing more than one term, terms should be
                          delimited by the query-delimiter character. By
                          default, all taxa will be included.       [optional]

If I wanted to include k__Animalia, and k__Fungi, and k__Protozoa but discard all other taxonomic labels, how do I do that?

SoilRotifer · August 13, 2020, 1:10pm

Hi @devonorourke,

This should do what you want:

qiime taxa filter-seqs \
  --i-sequences sequences.qza \
  --i-taxonomy taxonomy.qza \
  --p-include k__Animalia,k__Fungi,k__Protozoa \
  --o-filtered-sequences sequences-only-afp.qza

-Mike

devonorourke · August 13, 2020, 1:12pm

Great - thanks Mike.
So in general, when a QIIME function is accepting a list, just make sure to have comma-delimited terms without any quotations or spaces?
I'll give that a go, though I think the output might be relabeled as sequences-only-afp.qza instead of sequences-no-afp.qza (as I'm keeping all those kindom-specific taxonomy strings)

SoilRotifer · August 13, 2020, 1:18pm

@devonorourke Good catch, I'll fix that in my post. I am so used to excluding things that I default to no.

You can put the entire string in quotes if needed, like 'k__Animalia,k__Fungi,k__Protozoa'. Which can help if you have labels with brackets, etc... But yes, avoid spaces and only use commas.

devonorourke · August 13, 2020, 6:01pm

Minor follow up for @Nicholas_Bokulich - the same input file that failed for evaluate-cross-validate worked for evaluate-fit-classifier. Does that make things more confusing, less confusing, or equally ?

Nicholas_Bokulich · August 13, 2020, 6:12pm

that makes perfect sense — I was going to recommend that actually if you keep running into issues.

cross-validate stratifies the sequences across folds so that all clades (if possible) are represented in all folds... if there is a unique clade at the root level (e.g., kingdom) it cannot be stratified appropriately and hence will fail with the error you received.

fit-classifier does none of this trickery — it just trains the classifier once and tests it on the same exact sequences (simulating the "best case" scenario where all query sequences have one or more exact matches in the reference database). So there is no need for stratified clade representation, etc.

devonorourke · August 13, 2020, 6:15pm

The good news is that none of this matters for looking at Chordates or Arthropods, as that filtering removes any issues arising with evaluate-cross-validate!

Still not sure about the empty Kingdom. I tried filtering with qiime taxa filter-seqs and passed along a parameter that would require k__animalia,k__fungi,k__protozoa, and returned the same number of sequences as I input.

I wasn't using the most up to date version of RESCRIPt, so maybe the version bump will improve upon it. I'll give it another go today.

Thanks!

system · September 14, 2020, 12:15am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.