qiime rescript dereplicate --p-mode super leads to "Taxonomic label depth is uneven" error with qiime rescript evaluate-fit-classifier

Dear QIIME2 Developers,

Happy New Year!! Thank you for the wonderful tools!

QIIME2 Version & Installation Method:
QIIME2 2021.11 version installed with conda with rescript added in per the installation directions.

I am having an issue while running the qiime rescript evaluate-fit-classifier script after running the qiime rescript dereplicate script. The scripts and error are below, but the problem seems to be that the --p-mode 'super' from the latter script removes the last taxonomic identification (species or genus if run without species) from the taxonomy of some of the sequences. The script flow works perfectly running with the --p-mode 'uniq' instead.

Related forum post:
The following resolved forum post and error seems to be somewhat related: Rescript merge-taxa non-urgent bug - blank taxon values created when using 'super'

Thank you very much for your time and help.

Sincerely,

David Bradshaw

qzas used below found at (I think I made this available correctly:
https://drive.google.com/drive/folders/1XmCK_e4HzF1ZsyhFCn-zRJQeRshhZbQM?usp=sharing
Scripts Ran:

Good workflow:
qiime rescript dereplicate
--i-sequences silva-138.1-ssu-nr99-18SEuk1319f-18SEukBr-seqs.qza
--i-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza
--p-rank-handles 'silva'
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-18SEuk1319f-18SEukBr-derep-uniq.qza
--o-dereplicated-taxa silva-138.1-ssu-nr99-tax-18SEuk1319f-18SEukBr-derep-uniq.qza

qiime rescript evaluate-fit-classifier
--i-sequences silva-138.1-ssu-nr99-seqs-18SEuk1319f-18SEukBr-derep-uniq.qza
--i-taxonomy silva-138.1-ssu-nr99-tax-18SEuk1319f-18SEukBr-derep-uniq.qza
--o-classifier silva-138.1-99-18SEuk1319f-18SEukBr-2021.8-classifier.qza
--o-observed-taxonomy silva-138-99-18SEuk1319f-18SEukBr--derep-uniq-taxonomy-predicted-taxonomy.qza
--o-evaluation silva-138-99-18SEuk1319f-18SEukBr--derep-uniq-taxonomy-fit-classifier-evaluation.qzv
--p-reads-per-batch 10000

Error workflow:

qiime rescript dereplicate
--i-sequences silva-138.1-ssu-nr99-18SEuk1319f-18SEukBr-seqs.qza
--i-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza
--p-rank-handles 'silva'
--p-mode 'super'
--o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-18SEuk1319f-18SEukBr-derep-super.qza
--o-dereplicated-taxa silva-138.1-ssu-nr99-tax-18SEuk1319f-18SEukBr-derep-super.qza

qiime rescript evaluate-fit-classifier
--i-sequences silva-138.1-ssu-nr99-seqs-18SEuk1319f-18SEukBr-derep-super.qza
--i-taxonomy silva-138.1-ssu-nr99-tax-18SEuk1319f-18SEukBr-derep-super.qza
--o-classifier silva-138.1-99-341f-805r-2021.8-classifier.qza
--o-observed-taxonomy silva-138-99-341f-805r--derep-super-taxonomy-predicted-taxonomy.qza
--o-evaluation silva-138-99-341f-805r--derep-super-taxonomy-fit-classifier-evaluation.qzv
--p-reads-per-batch 10000
--verbose

Error message:

Traceback (most recent call last):
File "/home/microbiology/miniconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/q2cli/commands.py", line 339, in call
results = action(**arguments)
File "", line 2, in evaluate_fit_classifier
File "/home/microbiology/miniconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
outputs = self.callable_executor(scope, callable_args,
File "/home/microbiology/miniconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/qiime2/sdk/action.py", line 485, in callable_executor
outputs = self._callable(scope.ctx, **view_args)
File "/home/microbiology/miniconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/rescript/cross_validate.py", line 35, in evaluate_fit_classifier
taxa, seq_ids = _validate_cross_validate_inputs(taxonomy, sequences)
File "/home/microbiology/miniconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/rescript/cross_validate.py", line 205, in _validate_cross_validate_inputs
_validate_even_rank_taxonomy(taxa)
File "/home/microbiology/miniconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/rescript/cross_validate.py", line 382, in _validate_even_rank_taxonomy
raise ValueError('Taxonomic label depth is uneven. All taxonomies '
ValueError: Taxonomic label depth is uneven. All taxonomies must have the same number of semicolon-delimited ranks. The following features are too short: AAAA02038450.2584.4394, AB002062.1.1771, AB002076.1.1798, AB002079.1.1770, AB003944.1.2196, etc...

1 Like

HI @David_Bradshaw,

The dereplicate action was initially set up to handle taxonomies with only the standard 7-ranks (i.e. dpcofgs). That is, if any taxonomy was truncated at a higher level, we'd backfill them with the corresponding prefixes, e.g. f__; g__; s__.

For example, this:
KJ763795.1.1805 d__Eukaryota; k__Alveolata; p__Dinoflagellata; c__Dinophyceae; o__Gymnodiniphycidae

would become this:
KJ763795.1.1805 d__Eukaryota; k__Alveolata; p__Dinoflagellata; c__Dinophyceae; o__Gymnodiniphycidae; f__; g__; s__

It appears you are leveraging all the available SILVA taxonomy. In which case, the taxonomy rank backfilling of the prefixes will not work. We should probably update the LCA functionality so that it'll backfill using any number / combination of taxonomic ranks. :grimacing:

I'd suggest you stick with using the uniq option for now (keeps identical sequences with uniq taxonomic ranks), and let the classifier handle working out the taxonomic assignment. The classifier will, in effect, perform an LCA when it is unable to disambiguate very similar / identical sequences with differing taxonomy.

2 Likes

I"ve created an issue here.

Dear Mike Robeson,

Thank you very much for the backfill explanation. Sorry to add more to your (or someone else's) workload concerning updating the LCA functionality.

As you suggest I will stick with the uniq option and let the classifier handle the rest. Thank you for the solution and good luck in addressing the issue.

Sincerely,

David Bradshaw

1 Like