keyerror with RESCRIPt dereplicate and UNITE database

Hi @SoilRotifer thank you for the tutorial. I am facing a similar issue where the dereplication steps of the original data, and the extracted primer region area worked well, but once I perform the first iteration, I get the following error:

qiime rescript dereplicate \
    --i-sequences seq_filt_ext_derep_cull_i01.qza \
    --i-taxa tax_noSH_derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences seq_filt_ext_derep_cull_i01derep.qza \
    --o-dereplicated-taxa tax_noSH_derep_i01derep.qza
Plugin error from rescript:

  'SH1089862.09FU_UDB0271834_reps'

Debug info has been saved to /tmp/qiime2-q2cli-err-0aqs4xl0.log

And here is the debug file:

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --derep_fulllength /tmp/qiime2/ortmannac/data/1e88da8d-3b2d-4b62-aafc-0ed595311a2d/data/dna-sequences.fasta --output /tmp/tmpo935_3m5 --uc /tmp/tmpk3z0k2z1 --xsize --threads 8

WARNING: The derep_fulllength command does not support multithreading.
Only 1 thread used.
vsearch v2.22.1_linux_x86_64, 15.5GB RAM, 12 cores
https://github.com/torognes/vsearch

Dereplicating file /tmp/qiime2/ortmannac/data/1e88da8d-3b2d-4b62-aafc-0ed595311a2d/data/dna-sequences.fasta 100%
65787437 nt in 230904 seqs, min 34, max 2166, avg 285
minseqlength 32: 13 sequences discarded.
Sorting 100%
228535 unique sequences, avg cluster 1.0, median 1, max 23
Writing FASTA output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%
Traceback (most recent call last):
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'SH1089862.09FU_UDB0271834_reps'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/q2cli/commands.py", line 520, in __call__
    results = self._execute_action(
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/q2cli/commands.py", line 581, in _execute_action
    results = action(**arguments)
  File "<decorator-gen-740>", line 2, in dereplicate
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 342, in bound_callable
    outputs = self._callable_executor_(
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 566, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/rescript/dereplicate.py", line 66, in dereplicate
    derep_taxa, seqs_out = _dereplicate_taxa(
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/rescript/dereplicate.py", line 131, in _dereplicate_taxa
    uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/series.py", line 4771, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/apply.py", line 1123, in apply
    return self.apply_standard()
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/apply.py", line 1174, in apply_standard
    mapped = lib.map_infer(
  File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/rescript/dereplicate.py", line 131, in <lambda>
    uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/indexing.py", line 1312, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/indexing.py", line 1260, in _get_label
    return self.obj.xs(label, axis=axis)
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/generic.py", line 4056, in xs
    loc = index.get_loc(key)
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: 'SH1089862.09FU_UDB0271834_reps'

I installed rescript within a conda-installed qiime environment v2023.9.

Thank you!

Hi @emmlemore, how did you install RESCRIPt into qiime2-amplicon-2023.9? The standard install won't work for this version and you'll have to follow the instructions here:

Hi @emmlemore,

I should have mentioned mentioned that the key error you observe usually arises from a mismatch in processing between the taxonomy and sequence files. That is one of your files is missing information associated with SH1089862.09FU_UDB0271834_reps. You may want to check if this is indeed the case.

Hello @SoilRotifer

I do not remember how I installed RESCRIPt but I have redownloaded it into my qiime2 environment using the new instructions you provided.

Apologies but how would I go about checking for the missing information associated with
SH1089862.09FU_UDB0271834_reps and rectifying the issue? Would this sequence have been filtered out during one of the dereplication or culling steps?

1 Like

Hi @emmlemore,

I think the mismatch of the files occurred prior to the dereplication step. Can you provide the commands you ran prior to this? For example I noticed that your taxonomy file is named tax_noSH_derep.qza. What commands were used to generate this file?

The easiest would be to generate visualization for both the taxonomy file and the sequence file, like so:

qiime feature-table tabulate-seqs \
  --i-data rep-seqs.qza \
  --o-visualization rep-seqs.qzv

qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

Then you can search for the ID 'SH1089862.09FU_UDB0271834_reps' in both files to determine which of the two it is missing. I am expect that this ID is missing from your taxonomy file.

1 Like

I think the mismatch of the files occurred prior to the dereplication step. Can you provide the commands you ran prior to this? For example I noticed that your taxonomy file is named tax_noSH_derep.qza. What commands were used to generate this file?

I followed the tutorial here first: How to train a UNITE classifier using RESCRIPt

qiime rescript get-unite-data \
    --p-version 9.0 \
    --p-taxon-group eukaryotes \
    --p-cluster-id dynamic \
    --p-singletons \
    --output-dir unite_rescript \
    --verbose &> get_unite_data_verbose.log & disown

###removing sequences with unhelpful taxonomy
qiime taxa filter-seqs \
    --p-exclude Fungi_sp,mycota_sp,mycetes_sp \
    --i-taxonomy taxonomy.qza \
    --i-sequences sequences.qza \
    --o-filtered-sequences sequences_filtered.qza

###removing the specific accessions as annotated within UNITE
qiime rescript edit-taxonomy \
    --i-taxonomy taxonomy.qza \
    --o-edited-taxonomy tax_noSH.qza \   ###the outputted taxonomy file
    --p-search-strings ';sh__.*' \
    --p-replacement-strings '' \
    --p-use-regex

And then I started the curation using your tutorial starting with the first dereplication step:

qiime rescript dereplicate \
    --i-sequences sequences_filtered.qza \
    --i-taxa tax_noSH.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_filtered_derep.qza \
    --o-dereplicated-taxa tax_noSH_derep.qza

Also I looked at the 2 files and you are correct in that SH1089862.09FU_UDB0271834_reps is missing from the taxonomy file. I went in manually and deleted the fasta sequences because it matched to unhelpful taxonomy k__Eukaryota_kgd_Incertae_sedis;p__Eukaryota_phy_Incertae_sedis;c__Eukaryota_cls_Incertae_sedis;o__Eukaryota_ord_Incertae_sedis;f__Eukaryota_fam_Incertae_sedis;g__Eukaryota_gen_Incertae_sedis;s__Eukaryota_sp;sh__SH1089862.09FU

And ran it again, only to encounter the same problem with another sequence.

Oh right! That is where tax_noSH_derep.qza came from. Silly me! :man_facepalming:

I am wondering if something went wrong with the qiime rescript edit-taxonomy command as the new taxonomy error is showing a string with :

k__Eukaryota_kgd_Incertae_sedis;p__Eukaryota_phy_Incertae_sedis;c__Eukaryota_cls_Incertae_sedis;o__Eukaryota_ord_Incertae_sedis;f__Eukaryota_fam_Incertae_sedis;g__Eukaryota_gen_Incertae_sedis;s__Eukaryota_sp;sh__SH1089862.09FU

Note that ;sh__SH1089862.09FU should have been removed with the edit-taxonmy command. Perhaps re-run this?

1 Like

Hi @emmlemore,

I should also mention you can also run qiime rescript filter-taxa ... to make sure the sequence and taxonomy IDs match.

1 Like

Hi @emmlemore, we are trying to replicate the issue, can you provide us with the primer sequences are you using to extract your amplicon region? I assume you are combining the UNITE tutorial with this extract-seq-segments tutorial?

@emmlemore, I ran the following commands and everything seemed to work just fine. I even used the --p-singletons option while fetching the data.

qiime rescript get-unite-data \
    --p-version 9.0 \
    --p-taxon-group eukaryotes \
    --p-cluster-id dynamic \
    --p-singletons \
    --output-dir uniteDB \
    --verbose 

qiime taxa filter-seqs \
    --p-exclude Fungi_sp,mycota_sp,mycetes_sp \
    --i-taxonomy uniteDB/taxonomy.qza \
    --i-sequences uniteDB/sequences.qza \
    --o-filtered-sequences uniteDB/sequences-filtered.qza

qiime rescript edit-taxonomy \
    --i-taxonomy uniteDB/taxonomy.qza \
    --o-edited-taxonomy uniteDB/taxonomy-no-SH.qza \
    --p-search-strings ';sh__.*' \
    --p-replacement-strings '' \
    --p-use-regex

qiime rescript dereplicate   \
     --i-sequences uniteDB/sequences-filtered.qza   \
     --i-taxa uniteDB/taxonomy-no-SH.qza   \
     --p-mode 'uniq'   \
     --p-threads 8   \
     --o-dereplicated-sequences  uniteDB/sequences-filtered-derep.qza  \
     --o-dereplicated-taxa  uniteDB/taxonomy-no-SH-derep.qza

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads uniteDB/sequences-filtered-derep.qza \
    --i-reference-taxonomy uniteDB/taxonomy-no-SH-derep.qza \
    --o-classifier uniteDB/classifier.qza

The only thing I can think of, is that there is a mix-up between the dereplicated files generated through extraction of the amplicon region vs the dereplicated files of the original full-length sequence data? Perhaps double-check the taxonomy and sequence files, and make sure you are not overwriting one set of files with another. All I can say is we have not been able to replicate the issue. :man_shrugging:

Again, if you can send us the primer sequences you are using to extract the amplicon region, we can investigate more thoroughly. :slight_smile:

2 Likes

Hi @SoilRotifer,

I ended up not using the qiime taxa filter-seqs and qiime rescript evaluate-taxonomy steps in the first tutorial after downloading the data and directly went to your tutorial (i.e., download > dereplicate) and did not encounter the issue again.

Nevertheless, here is the full code and primer sequence that I used to generate that error:

qiime rescript get-unite-data \
    --p-version 9.0 \
    --p-taxon-group eukaryotes \
    --p-cluster-id dynamic \
    --p-singletons \
    --output-dir unite_rescript \
    --verbose &> get_unite_data_verbose.log & disown

###removing sequences with unhelpful taxonomy
qiime taxa filter-seqs \
    --p-exclude Fungi_sp,mycota_sp,mycetes_sp \
    --i-taxonomy taxonomy.qza \
    --i-sequences sequences.qza \
    --o-filtered-sequences sequences_filtered.qza

###removing the specific accessions as annotated within UNITE
qiime rescript edit-taxonomy \
    --i-taxonomy taxonomy.qza \
    --o-edited-taxonomy taxonomy_no_SH.qza \
    --p-search-strings ';sh__.*' \
    --p-replacement-strings '' \
    --p-use-regex

###dereplicate filtered sequences
qiime rescript dereplicate \
    --i-sequences sequences_filt.qza \
    --i-taxa taxonomy_noSH.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_filt_derep.qza \
    --o-dereplicated-taxa taxonomy_noSH_derep.qza

###generate an initial reference pool by extracting the region of interest
qiime feature-classifier extract-reads \
   --i-sequences sequences_filt.qza \
   --p-f-primer GTGAATCATCGAATCTTTGAA \   ###ITS86(F)
   --p-r-primer TCCTCCGCTTATTGATATGC \   ###ITS4(R)
   --p-n-jobs 8 \
   --p-min-length 100 \
   --o-reads sequences_filt_ext.qza \
   --verbose &> extract_primer_sequence_verbose.log

###dereplicate sequences from our target extracted amplicon region
qiime rescript dereplicate \
    --i-sequences sequences_filt_ext.qza \
    --i-taxa taxonomy_noSH.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences seq_filt_ext_derep.qza \
    --o-dereplicated-taxa taxonomy_noSH_derep.qza

###remove sequences with ambiguous IUPAC nts
qiime rescript cull-seqs \
    --i-sequences seq_filt_ext_derep.qza  \
    --p-n-jobs 8 \
    --p-num-degenerates 2 \ 
    --p-homopolymer-length 8 \ 
    --o-clean-sequences seq_filt_ext_derep_cull.qza

###visualize
qiime feature-table tabulate-seqs \
  --i-data seq_filt_ext_derep_cull.qza \
  --o-visualization seq_filt_ext_derep_cull.qzv

###query our reference sequences against our primer extracted region
qiime rescript extract-seq-segments \
    --i-input-sequences sequences_filt.qza \
    --i-reference-segment-sequences seq_filt_ext_derep_cull.qza \
    --p-perc-identity 0.7 \
    --p-min-seq-len 100 \
    --p-threads 8 \
    --o-extracted-sequence-segments seq_filt_ext_derep_cull_i01.qza \
    --o-unmatched-sequences seq_filt_ext_derep_cull_i01unmatch.qza \
    --verbose

###filter again because with each iteration we'll likely extract segments that are poor quality or contain ambiguous nts
qiime rescript dereplicate \
    --i-sequences seq_filt_ext_derep_cull_i01.qza \
    --i-taxa taxonomy_noSH_derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences seq_filt_ext_derep_cull_i01derep.qza \
    --o-dereplicated-taxa tax_noSH_derep_i01derep.qza

Plugin error from rescript:

  'SH1089862.09FU_UDB0271834_reps'

Debug info has been saved to /tmp/qiime2-q2cli-err-4ynn4a61.log

Thank you @emmlemore,

I'll run these commands on my end and let you know if I can replicate the error.

Hi @emmlemore,

Note there were many inconsistant file names i.e. taxonomy_noSH_derep.qza and taxonomy_no_SH_derep.qza, the use of ..filt.. and ..filtered... I tried my best to edit them for clarity and consistency of this post.

I think I figured out the issue. My initial suspicion was correct, that there was a mismatch or overwriting of files. I found that the dereplicate command you ran after the extract-reads step you accidentally named your output file taxonomy_noSH_derep.qza :

qiime rescript dereplicate \
    --i-sequences sequences_filt_ext.qza \
    --i-taxa taxonomy_no_SH.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences seq_filt_ext_derep.qza \
    --o-dereplicated-taxa taxonomy_no_SH_derep.qza

This is a problem as this overwrote taxonomy_no_SH_derep.qza file from the previous initial dereplicate command ran on the full length sequences (i.e. after the filter-seqs and edit-taxonomy commands):

qiime rescript dereplicate \
    --i-sequences sequences_filt.qza \
    --i-taxa taxonomy_no_SH.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_filt_derep.qza \
    --o-dereplicated-taxa taxonomy_no_SH_derep.qza

This is why it it could not find 'SH1089862.09FU_UDB0271834_reps', as it was not present in the dereplicated taxonomy file post segment extraction, but is present in the initial derep file. This is why we reuse this file in the tutorial a few times.

I've corrected your pipeline, along with file name edits and pasted below. Everything works as intended. Give it a try and let me know if this works for you.

qiime rescript get-unite-data \
    --p-version 9.0 \
    --p-taxon-group eukaryotes \
    --p-cluster-id dynamic \
    --p-singletons \
    --output-dir unite_rescript \
    --verbose 

cd unite_rescript

# removing sequences with unhelpful taxonomy
qiime taxa filter-seqs \
    --p-exclude Fungi_sp,mycota_sp,mycetes_sp \
    --i-taxonomy taxonomy.qza \
    --i-sequences sequences.qza \
    --o-filtered-sequences sequences_filtered.qza

# removing the specific accessions as annotated within UNITE
qiime rescript edit-taxonomy \
    --i-taxonomy taxonomy.qza \
    --o-edited-taxonomy taxonomy_no_SH.qza \
    --p-search-strings ';sh__.*' \
    --p-replacement-strings '' \
    --p-use-regex

# dereplicate filtered sequences
qiime rescript dereplicate \
    --i-sequences sequences_filtered.qza \
    --i-taxa taxonomy_no_SH.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_filtered_derep.qza \
    --o-dereplicated-taxa taxonomy_no_SH_derep.qza

# generate an initial reference pool by extracting the region of interest
# ITS86(F) - ITS4(R)
qiime feature-classifier extract-reads \
   --i-sequences sequences_filtered.qza \
   --p-f-primer GTGAATCATCGAATCTTTGAA \
   --p-r-primer TCCTCCGCTTATTGATATGC \
   --p-n-jobs 8 \
   --p-min-length 100 \
   --o-reads sequences_filtered_ext.qza \
   --verbose

# dereplicate sequences from our target extracted amplicon region
# Note the differeint taxonomy input and output files.
qiime rescript dereplicate \
    --i-sequences sequences_filtered_ext.qza \
    --i-taxa taxonomy_no_SH_derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_filtered_ext_derep.qza \
    --o-dereplicated-taxa taxonomy_no_SH_ext_derep.qza

# remove sequences with ambiguous IUPAC nts
qiime rescript cull-seqs \
    --i-sequences sequences_filtered_ext_derep.qza \
    --p-n-jobs 8 \
    --p-num-degenerates 2 \
    --p-homopolymer-length 8 \
    --o-clean-sequences sequences_filtered_ext_derep_cull.qza

# visualize
qiime feature-table tabulate-seqs \
  --i-data sequences_filtered_ext_derep_cull.qza \
  --o-visualization sequences_filtered_ext_derep_cull.qzv

# query our reference sequences against our primer extracted region
qiime rescript extract-seq-segments \
    --i-input-sequences sequences_filtered.qza \
    --i-reference-segment-sequences sequences_filtered_ext_derep_cull.qza \
    --p-perc-identity 0.7 \
    --p-min-seq-len 100 \
    --p-threads 8 \
    --o-extracted-sequence-segments sequences_filtered_ext_derep_cull_i01.qza \
    --o-unmatched-sequences sequences_filtered_ext_derep_cull_i01unmatch.qza \
    --verbose

# filter again because with each iteration we will likely extract segments that are poor quality or contain ambiguous nts
qiime rescript dereplicate \
    --i-sequences sequences_filtered_ext_derep_cull_i01.qza \
    --i-taxa taxonomy_no_SH_derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_filtered_ext_derep_cull_i01derep.qza \
    --o-dereplicated-taxa taxonomy_no_SH_derep_i01derep.qza

-Cheers! :slight_smile:

3 Likes

Ah sorry about the file name mix up! Thanks so much for your help!!

2 Likes

Not a problem at all. If I had a dollar for every time I did the same thing... I'd be rich! :stuck_out_tongue:

2 Likes