The error I received that started me down this little rabbit hole began when I ran qiime rescript evaluate-cross-validate
, and it looked like this:
Plugin error from rescript:
unknown kingdom in query set: ;p__;c__;o__;f__;g__;s__
Debug info has been saved to /tmp/qiime2-q2cli-err-xumwidbw.log
Validation: 37.31s
/home/dorourke/miniconda/envs/rescript/lib/python3.6/site-packages/sklearn/model_selection/_split.py:657: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=3.
% (min_groups, self.n_splits)), Warning)
Naturally, this was something that popped up only after I’d spent weeks building a COI database . A tiny bit of background first about how the final database came to be:
- Reference sequences are pulled from BOLD using an R script.
- These sequences are imported in to QIIME, and then filtered using a series of
qiime rescript
steps including:-
cull-seqs
(removed seqs with > 4 N’s, homopolymer runs > 11) -
filter-seqs-length
(retained seqs between 250 - 1600 bp) -
dereplicate
in super mode, with--p-derep-prefix
-
There were also a series of steps used to create a database from those filtered sequences that was trimmed to the primers I use. The empty taxonomic features with just k__Animalia;p__;c__;...
- are present in both databases however, so I expect the same error message noted at the beginning of this post.
There were 32 instances where this taxonomy string was assigned:
tax=k__Animalia;p__;c__;o__;f__;g__;s__
These FeatureIDs with empty taxonomies aren’t empty sequences either. I ran NCBI blast on a handful of sequences and got hits above 90% coverage and 98% identity - so they’re not garbage (in fact, they’re apparently different fish ) …
The obvious thing to do here, I think, it to simply add an additional filtering step to remove this particular kind of string once dereplication is done - does that sound right?
This got me to thinking that some kind of taxonomic filtering should be part of qiime rescript dereplicate
by default. What value would there be in retaining sequences that contain only domain-rank information? Nevertheless, I can imagine that databases using different marker genes (or different databases within the same marker gene too) might need different filtering steps. In any event, I think it’s worth considering whether this level of information should be allowed, given that it’s going to likely lead to a crash in the downstream family of evaluate
functions in RESCRIPt.
Thanks for your thoughts @SoilRotifer @thermokarst @Nicholas_Bokulich!