RESCRIPT dereplicate uniq mode UnicodeEncodeError

Hi,

I'm trying to run through the silva example of RESCRIPt (Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt) and am encountering an error at the dereplication step. I'm currently using QIIME2 2022.11 and installed RESCRIPt via pip as suggested:

pip install git+https://github.com/bokulich-lab/RESCRIPt.git

Here are the commands I ran:

qiime rescript get-silva-data \
    --p-version '138.1' \
    --p-target 'SSURef_NR99' \
    --p-include-species-labels \
    --o-silva-sequences silva-138.1-ssu-nr99-rna-seqs.qza \
    --o-silva-taxonomy silva-138.1-ssu-nr99-tax.qza

qiime rescript reverse-transcribe \
    --i-rna-sequences silva-138.1-ssu-nr99-rna-seqs.qza 
    --o-dna-sequences silva-138.1-ssu-nr99-seqs.qza

qiime rescript cull-seqs \
    --i-sequences silva-138.1-ssu-nr99-seqs.qza \
    --o-clean-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza

qiime rescript filter-seqs-length-by-taxon \
    --i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza \
    --i-taxonomy silva-138.1-ssu-nr99-tax.qza \
    --p-labels Archaea Bacteria Eukaryota \
    --p-min-lens 900 1200 1400 \
    --o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt.qza \
    --o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard.qza 

qiime rescript dereplicate \
    --i-sequences silva-138.1-ssu-nr99-seqs-filt.qza  \
    --i-taxa silva-138.1-ssu-nr99-tax.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza \
    --o-dereplicated-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza

The dereplicate command produced the following error:

$ cat /scratch/rlampe/30545035.tscc-mgr7.local/qiime2-q2cli-err-tij6a2zb.log
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --derep_fulllength /scratch/rlampe/30545035.tscc-mgr7.local/qiime2/rlampe/data/dab83010-1036-49df-9f40-b75d7c4d9da7/data/dna-sequences.fasta --output /scratch/rlampe/30545035.tscc-mgr7.local/tmpcfgsrr46 --uc /scratch/rlampe/30545035.tscc-mgr7.local/tmp00vq71bj --xsize --threads 1

vsearch v2.22.1_linux_x86_64, 1007.2GB RAM, 64 cores
https://github.com/torognes/vsearch

Dereplicating file /scratch/rlampe/30545035.tscc-mgr7.local/qiime2/rlampe/data/dab83010-1036-49df-9f40-b75d7c4d9da7/data/dna-sequences.fasta 100%
699161179 nt in 477562 seqs, min 900, max 3983, avg 1464
Sorting 100%
435502 unique sequences, avg cluster 1.1, median 1, max 893
Writing FASTA output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%
/projects/ps-allenlab/rlampe/bin/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/rescript/dereplicate.py:115: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
/projects/ps-allenlab/rlampe/bin/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/rescript/dereplicate.py:116: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  uc['centroidtaxa'] = uc['centroidID'].apply(lambda x: taxa.loc[x])
Traceback (most recent call last):
  File "/projects/ps-allenlab/rlampe/bin/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2cli/commands.py", line 352, in __call__
    results = action(**arguments)
  File "<decorator-gen-490>", line 2, in dereplicate
  File "/projects/ps-allenlab/rlampe/bin/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/sdk/action.py", line 234, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/projects/ps-allenlab/rlampe/bin/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/sdk/action.py", line 408, in _callable_executor_
    artifact = qiime2.sdk.Artifact._from_view(
  File "/projects/ps-allenlab/rlampe/bin/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/sdk/result.py", line 356, in _from_view
    artifact._archiver = archive.Archiver.from_data(
  File "/projects/ps-allenlab/rlampe/bin/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/core/archive/archiver.py", line 408, in from_data
    Format.write(rec, type, format, data_initializer,
  File "/projects/ps-allenlab/rlampe/bin/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/core/archive/format/v5.py", line 20, in write
    super().write(archive_record, type, format, data_initializer,
  File "/projects/ps-allenlab/rlampe/bin/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/core/archive/format/v1.py", line 25, in write
    provenance_capture.finalize(
  File "/projects/ps-allenlab/rlampe/bin/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/core/archive/provenance.py", line 320, in finalize
    self.write_citations_bib()
  File "/projects/ps-allenlab/rlampe/bin/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/core/archive/provenance.py", line 311, in write_citations_bib
    self.citations.save(str(self.path / self.CITATION_FILE))
  File "/projects/ps-allenlab/rlampe/bin/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/core/cite.py", line 71, in save
    bp.dump(db, f, writer=writer)
  File "/projects/ps-allenlab/rlampe/bin/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/bibtexparser/__init__.py", line 108, in dump
    bibtex_file.write(writer.write(bib_database))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0161' in position 2884: ordinal not in range(256)

Thanks in advance!
Rob

Hi @rhlampe ,

This error is indicating that you need to set another language encoding, as the default that you are using (latin-1) cannot encode some special characters (in the citations specifically! So nothing to do with RESCRIPt itself). See this topic for troubleshooting steps:

(you can also search the forum for "codec can't encode character" to see quite a few similar topics)

Good luck!

3 Likes

Thanks so much! Fixed it following that thread.

1 Like