keyerror with RESCRIPt dereplicate and UNITE database

emmlemore · January 19, 2024, 8:01pm

Hi @SoilRotifer thank you for the tutorial. I am facing a similar issue where the dereplication steps of the original data, and the extracted primer region area worked well, but once I perform the first iteration, I get the following error:

qiime rescript dereplicate \
    --i-sequences seq_filt_ext_derep_cull_i01.qza \
    --i-taxa tax_noSH_derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences seq_filt_ext_derep_cull_i01derep.qza \
    --o-dereplicated-taxa tax_noSH_derep_i01derep.qza
Plugin error from rescript:

  'SH1089862.09FU_UDB0271834_reps'

Debug info has been saved to /tmp/qiime2-q2cli-err-0aqs4xl0.log

And here is the debug file:

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --derep_fulllength /tmp/qiime2/ortmannac/data/1e88da8d-3b2d-4b62-aafc-0ed595311a2d/data/dna-sequences.fasta --output /tmp/tmpo935_3m5 --uc /tmp/tmpk3z0k2z1 --xsize --threads 8

WARNING: The derep_fulllength command does not support multithreading.
Only 1 thread used.
vsearch v2.22.1_linux_x86_64, 15.5GB RAM, 12 cores
https://github.com/torognes/vsearch

Dereplicating file /tmp/qiime2/ortmannac/data/1e88da8d-3b2d-4b62-aafc-0ed595311a2d/data/dna-sequences.fasta 100%
65787437 nt in 230904 seqs, min 34, max 2166, avg 285
minseqlength 32: 13 sequences discarded.
Sorting 100%
228535 unique sequences, avg cluster 1.0, median 1, max 23
Writing FASTA output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%
Traceback (most recent call last):
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'SH1089862.09FU_UDB0271834_reps'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/q2cli/commands.py", line 520, in __call__
    results = self._execute_action(
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/q2cli/commands.py", line 581, in _execute_action
    results = action(**arguments)
  File "<decorator-gen-740>", line 2, in dereplicate
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 342, in bound_callable
    outputs = self._callable_executor_(
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 566, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/rescript/dereplicate.py", line 66, in dereplicate
    derep_taxa, seqs_out = _dereplicate_taxa(
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/rescript/dereplicate.py", line 131, in _dereplicate_taxa
    uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/series.py", line 4771, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/apply.py", line 1123, in apply
    return self.apply_standard()
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/apply.py", line 1174, in apply_standard
    mapped = lib.map_infer(
  File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/rescript/dereplicate.py", line 131, in <lambda>
    uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/indexing.py", line 1312, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/indexing.py", line 1260, in _get_label
    return self.obj.xs(label, axis=axis)
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/generic.py", line 4056, in xs
    loc = index.get_loc(key)
  File "/home/ortmannac/miniconda3/envs/qiime2/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: 'SH1089862.09FU_UDB0271834_reps'

I installed rescript within a conda-installed qiime environment v2023.9.

Thank you!

SoilRotifer · January 19, 2024, 8:09pm

Hi @emmlemore, how did you install RESCRIPt into qiime2-amplicon-2023.9? The standard install won't work for this version and you'll have to follow the instructions here:

SoilRotifer · January 19, 2024, 9:05pm

Hi @emmlemore,

I should have mentioned mentioned that the key error you observe usually arises from a mismatch in processing between the taxonomy and sequence files. That is one of your files is missing information associated with SH1089862.09FU_UDB0271834_reps. You may want to check if this is indeed the case.

emmlemore · January 23, 2024, 1:33pm

Hello @SoilRotifer

I do not remember how I installed RESCRIPt but I have redownloaded it into my qiime2 environment using the new instructions you provided.

Apologies but how would I go about checking for the missing information associated with
SH1089862.09FU_UDB0271834_reps and rectifying the issue? Would this sequence have been filtered out during one of the dereplication or culling steps?

SoilRotifer · January 23, 2024, 2:26pm

Hi @emmlemore,

I think the mismatch of the files occurred prior to the dereplication step. Can you provide the commands you ran prior to this? For example I noticed that your taxonomy file is named tax_noSH_derep.qza. What commands were used to generate this file?

The easiest would be to generate visualization for both the taxonomy file and the sequence file, like so:

qiime feature-table tabulate-seqs \
  --i-data rep-seqs.qza \
  --o-visualization rep-seqs.qzv

qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

Then you can search for the ID 'SH1089862.09FU_UDB0271834_reps' in both files to determine which of the two it is missing. I am expect that this ID is missing from your taxonomy file.

emmlemore · January 23, 2024, 2:44pm

I think the mismatch of the files occurred prior to the dereplication step. Can you provide the commands you ran prior to this? For example I noticed that your taxonomy file is named tax_noSH_derep.qza. What commands were used to generate this file?

I followed the tutorial here first: How to train a UNITE classifier using RESCRIPt

qiime rescript get-unite-data \
    --p-version 9.0 \
    --p-taxon-group eukaryotes \
    --p-cluster-id dynamic \
    --p-singletons \
    --output-dir unite_rescript \
    --verbose &> get_unite_data_verbose.log & disown

###removing sequences with unhelpful taxonomy
qiime taxa filter-seqs \
    --p-exclude Fungi_sp,mycota_sp,mycetes_sp \
    --i-taxonomy taxonomy.qza \
    --i-sequences sequences.qza \
    --o-filtered-sequences sequences_filtered.qza

###removing the specific accessions as annotated within UNITE
qiime rescript edit-taxonomy \
    --i-taxonomy taxonomy.qza \
    --o-edited-taxonomy tax_noSH.qza \   ###the outputted taxonomy file
    --p-search-strings ';sh__.*' \
    --p-replacement-strings '' \
    --p-use-regex

And then I started the curation using your tutorial starting with the first dereplication step:

qiime rescript dereplicate \
    --i-sequences sequences_filtered.qza \
    --i-taxa tax_noSH.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_filtered_derep.qza \
    --o-dereplicated-taxa tax_noSH_derep.qza

Also I looked at the 2 files and you are correct in that SH1089862.09FU_UDB0271834_reps is missing from the taxonomy file. I went in manually and deleted the fasta sequences because it matched to unhelpful taxonomy k__Eukaryota_kgd_Incertae_sedis;p__Eukaryota_phy_Incertae_sedis;c__Eukaryota_cls_Incertae_sedis;o__Eukaryota_ord_Incertae_sedis;f__Eukaryota_fam_Incertae_sedis;g__Eukaryota_gen_Incertae_sedis;s__Eukaryota_sp;sh__SH1089862.09FU

And ran it again, only to encounter the same problem with another sequence.

SoilRotifer · January 23, 2024, 2:50pm

Oh right! That is where tax_noSH_derep.qza came from. Silly me!

I am wondering if something went wrong with the qiime rescript edit-taxonomy command as the new taxonomy error is showing a string with :

k__Eukaryota_kgd_Incertae_sedis;p__Eukaryota_phy_Incertae_sedis;c__Eukaryota_cls_Incertae_sedis;o__Eukaryota_ord_Incertae_sedis;f__Eukaryota_fam_Incertae_sedis;g__Eukaryota_gen_Incertae_sedis;s__Eukaryota_sp;sh__SH1089862.09FU

Note that ;sh__SH1089862.09FU should have been removed with the edit-taxonmy command. Perhaps re-run this?

SoilRotifer · January 23, 2024, 6:30pm

Hi @emmlemore,

I should also mention you can also run qiime rescript filter-taxa ... to make sure the sequence and taxonomy IDs match.

SoilRotifer · January 23, 2024, 9:34pm

Hi @emmlemore, we are trying to replicate the issue, can you provide us with the primer sequences are you using to extract your amplicon region? I assume you are combining the UNITE tutorial with this extract-seq-segments tutorial?

SoilRotifer · January 23, 2024, 9:55pm

@emmlemore, I ran the following commands and everything seemed to work just fine. I even used the --p-singletons option while fetching the data.

qiime rescript get-unite-data \
    --p-version 9.0 \
    --p-taxon-group eukaryotes \
    --p-cluster-id dynamic \
    --p-singletons \
    --output-dir uniteDB \
    --verbose 

qiime taxa filter-seqs \
    --p-exclude Fungi_sp,mycota_sp,mycetes_sp \
    --i-taxonomy uniteDB/taxonomy.qza \
    --i-sequences uniteDB/sequences.qza \
    --o-filtered-sequences uniteDB/sequences-filtered.qza

qiime rescript edit-taxonomy \
    --i-taxonomy uniteDB/taxonomy.qza \
    --o-edited-taxonomy uniteDB/taxonomy-no-SH.qza \
    --p-search-strings ';sh__.*' \
    --p-replacement-strings '' \
    --p-use-regex

qiime rescript dereplicate   \
     --i-sequences uniteDB/sequences-filtered.qza   \
     --i-taxa uniteDB/taxonomy-no-SH.qza   \
     --p-mode 'uniq'   \
     --p-threads 8   \
     --o-dereplicated-sequences  uniteDB/sequences-filtered-derep.qza  \
     --o-dereplicated-taxa  uniteDB/taxonomy-no-SH-derep.qza

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads uniteDB/sequences-filtered-derep.qza \
    --i-reference-taxonomy uniteDB/taxonomy-no-SH-derep.qza \
    --o-classifier uniteDB/classifier.qza

The only thing I can think of, is that there is a mix-up between the dereplicated files generated through extraction of the amplicon region vs the dereplicated files of the original full-length sequence data? Perhaps double-check the taxonomy and sequence files, and make sure you are not overwriting one set of files with another. All I can say is we have not been able to replicate the issue.

Again, if you can send us the primer sequences you are using to extract the amplicon region, we can investigate more thoroughly.

emmlemore · January 25, 2024, 12:57pm

Hi @SoilRotifer,

I ended up not using the qiime taxa filter-seqs and qiime rescript evaluate-taxonomy steps in the first tutorial after downloading the data and directly went to your tutorial (i.e., download > dereplicate) and did not encounter the issue again.

Nevertheless, here is the full code and primer sequence that I used to generate that error:

qiime rescript get-unite-data \
    --p-version 9.0 \
    --p-taxon-group eukaryotes \
    --p-cluster-id dynamic \
    --p-singletons \
    --output-dir unite_rescript \
    --verbose &> get_unite_data_verbose.log & disown

###removing sequences with unhelpful taxonomy
qiime taxa filter-seqs \
    --p-exclude Fungi_sp,mycota_sp,mycetes_sp \
    --i-taxonomy taxonomy.qza \
    --i-sequences sequences.qza \
    --o-filtered-sequences sequences_filtered.qza

###removing the specific accessions as annotated within UNITE
qiime rescript edit-taxonomy \
    --i-taxonomy taxonomy.qza \
    --o-edited-taxonomy taxonomy_no_SH.qza \
    --p-search-strings ';sh__.*' \
    --p-replacement-strings '' \
    --p-use-regex

###dereplicate filtered sequences
qiime rescript dereplicate \
    --i-sequences sequences_filt.qza \
    --i-taxa taxonomy_noSH.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_filt_derep.qza \
    --o-dereplicated-taxa taxonomy_noSH_derep.qza

###generate an initial reference pool by extracting the region of interest
qiime feature-classifier extract-reads \
   --i-sequences sequences_filt.qza \
   --p-f-primer GTGAATCATCGAATCTTTGAA \   ###ITS86(F)
   --p-r-primer TCCTCCGCTTATTGATATGC \   ###ITS4(R)
   --p-n-jobs 8 \
   --p-min-length 100 \
   --o-reads sequences_filt_ext.qza \
   --verbose &> extract_primer_sequence_verbose.log

###dereplicate sequences from our target extracted amplicon region
qiime rescript dereplicate \
    --i-sequences sequences_filt_ext.qza \
    --i-taxa taxonomy_noSH.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences seq_filt_ext_derep.qza \
    --o-dereplicated-taxa taxonomy_noSH_derep.qza

###remove sequences with ambiguous IUPAC nts
qiime rescript cull-seqs \
    --i-sequences seq_filt_ext_derep.qza  \
    --p-n-jobs 8 \
    --p-num-degenerates 2 \ 
    --p-homopolymer-length 8 \ 
    --o-clean-sequences seq_filt_ext_derep_cull.qza

###visualize
qiime feature-table tabulate-seqs \
  --i-data seq_filt_ext_derep_cull.qza \
  --o-visualization seq_filt_ext_derep_cull.qzv

###query our reference sequences against our primer extracted region
qiime rescript extract-seq-segments \
    --i-input-sequences sequences_filt.qza \
    --i-reference-segment-sequences seq_filt_ext_derep_cull.qza \
    --p-perc-identity 0.7 \
    --p-min-seq-len 100 \
    --p-threads 8 \
    --o-extracted-sequence-segments seq_filt_ext_derep_cull_i01.qza \
    --o-unmatched-sequences seq_filt_ext_derep_cull_i01unmatch.qza \
    --verbose

###filter again because with each iteration we'll likely extract segments that are poor quality or contain ambiguous nts
qiime rescript dereplicate \
    --i-sequences seq_filt_ext_derep_cull_i01.qza \
    --i-taxa taxonomy_noSH_derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences seq_filt_ext_derep_cull_i01derep.qza \
    --o-dereplicated-taxa tax_noSH_derep_i01derep.qza

Plugin error from rescript:

  'SH1089862.09FU_UDB0271834_reps'

Debug info has been saved to /tmp/qiime2-q2cli-err-4ynn4a61.log

SoilRotifer · January 25, 2024, 2:09pm

Thank you @emmlemore,

I'll run these commands on my end and let you know if I can replicate the error.

SoilRotifer · January 25, 2024, 2:48pm

Hi @emmlemore,

Note there were many inconsistant file names i.e. taxonomy_noSH_derep.qza and taxonomy_no_SH_derep.qza, the use of ..filt.. and ..filtered... I tried my best to edit them for clarity and consistency of this post.

I think I figured out the issue. My initial suspicion was correct, that there was a mismatch or overwriting of files. I found that the dereplicate command you ran after the extract-reads step you accidentally named your output file taxonomy_noSH_derep.qza :

qiime rescript dereplicate \
    --i-sequences sequences_filt_ext.qza \
    --i-taxa taxonomy_no_SH.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences seq_filt_ext_derep.qza \
    --o-dereplicated-taxa taxonomy_no_SH_derep.qza

This is a problem as this overwrote taxonomy_no_SH_derep.qza file from the previous initial dereplicate command ran on the full length sequences (i.e. after the filter-seqs and edit-taxonomy commands):

qiime rescript dereplicate \
    --i-sequences sequences_filt.qza \
    --i-taxa taxonomy_no_SH.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_filt_derep.qza \
    --o-dereplicated-taxa taxonomy_no_SH_derep.qza

This is why it it could not find 'SH1089862.09FU_UDB0271834_reps', as it was not present in the dereplicated taxonomy file post segment extraction, but is present in the initial derep file. This is why we reuse this file in the tutorial a few times.

I've corrected your pipeline, along with file name edits and pasted below. Everything works as intended. Give it a try and let me know if this works for you.

qiime rescript get-unite-data \
    --p-version 9.0 \
    --p-taxon-group eukaryotes \
    --p-cluster-id dynamic \
    --p-singletons \
    --output-dir unite_rescript \
    --verbose 

cd unite_rescript

# removing sequences with unhelpful taxonomy
qiime taxa filter-seqs \
    --p-exclude Fungi_sp,mycota_sp,mycetes_sp \
    --i-taxonomy taxonomy.qza \
    --i-sequences sequences.qza \
    --o-filtered-sequences sequences_filtered.qza

# removing the specific accessions as annotated within UNITE
qiime rescript edit-taxonomy \
    --i-taxonomy taxonomy.qza \
    --o-edited-taxonomy taxonomy_no_SH.qza \
    --p-search-strings ';sh__.*' \
    --p-replacement-strings '' \
    --p-use-regex

# dereplicate filtered sequences
qiime rescript dereplicate \
    --i-sequences sequences_filtered.qza \
    --i-taxa taxonomy_no_SH.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_filtered_derep.qza \
    --o-dereplicated-taxa taxonomy_no_SH_derep.qza

# generate an initial reference pool by extracting the region of interest
# ITS86(F) - ITS4(R)
qiime feature-classifier extract-reads \
   --i-sequences sequences_filtered.qza \
   --p-f-primer GTGAATCATCGAATCTTTGAA \
   --p-r-primer TCCTCCGCTTATTGATATGC \
   --p-n-jobs 8 \
   --p-min-length 100 \
   --o-reads sequences_filtered_ext.qza \
   --verbose

# dereplicate sequences from our target extracted amplicon region
# Note the differeint taxonomy input and output files.
qiime rescript dereplicate \
    --i-sequences sequences_filtered_ext.qza \
    --i-taxa taxonomy_no_SH_derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_filtered_ext_derep.qza \
    --o-dereplicated-taxa taxonomy_no_SH_ext_derep.qza

# remove sequences with ambiguous IUPAC nts
qiime rescript cull-seqs \
    --i-sequences sequences_filtered_ext_derep.qza \
    --p-n-jobs 8 \
    --p-num-degenerates 2 \
    --p-homopolymer-length 8 \
    --o-clean-sequences sequences_filtered_ext_derep_cull.qza

# visualize
qiime feature-table tabulate-seqs \
  --i-data sequences_filtered_ext_derep_cull.qza \
  --o-visualization sequences_filtered_ext_derep_cull.qzv

# query our reference sequences against our primer extracted region
qiime rescript extract-seq-segments \
    --i-input-sequences sequences_filtered.qza \
    --i-reference-segment-sequences sequences_filtered_ext_derep_cull.qza \
    --p-perc-identity 0.7 \
    --p-min-seq-len 100 \
    --p-threads 8 \
    --o-extracted-sequence-segments sequences_filtered_ext_derep_cull_i01.qza \
    --o-unmatched-sequences sequences_filtered_ext_derep_cull_i01unmatch.qza \
    --verbose

# filter again because with each iteration we will likely extract segments that are poor quality or contain ambiguous nts
qiime rescript dereplicate \
    --i-sequences sequences_filtered_ext_derep_cull_i01.qza \
    --i-taxa taxonomy_no_SH_derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_filtered_ext_derep_cull_i01derep.qza \
    --o-dereplicated-taxa taxonomy_no_SH_derep_i01derep.qza

-Cheers!

emmlemore · January 25, 2024, 5:23pm

Ah sorry about the file name mix up! Thanks so much for your help!!

SoilRotifer · January 25, 2024, 5:27pm

Not a problem at all. If I had a dollar for every time I did the same thing... I'd be rich!