Rescript get-ncbi-data MemoryError

klunn · April 9, 2025, 6:10pm

Hello,

I'm trying to run the rescript get-ncbi-data on a cluster to download the arthropod database. When I run this script:

qiime rescript get-ncbi-data
--p-query 'txid6656[ORGN] AND species[SCN] NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]'
--p-n-jobs 8
--o-sequences NCBI_Arthropoda/ncbi-refseqs-unfiltered.qza
--o-taxonomy NCBI_Arthropoda/ncbi-refseqs-taxonomy-unfiltered.qza

It throws up this error:

Plugin error from rescript:

A result has failed to un-serialize. Please ensure that the objects returned by the function are always picklable.

I've managed to download smaller databases on my own computer using this code, however my computer doesn't seem to be capable of computing the command for the larger databases which has led me to use a cluster. The problem is, no matter which database I now try and download, it throws out this error.

Any ideas as to why this might be happening?

Thanks!

Nicholas_Bokulich · April 9, 2025, 6:53pm

Hi @klunn ,
This is a new error we have not seen reported before, and it does not look like an error message from RESCRIPt itself. It looks like most likely this is a message from joblib, the package the rescript uses in the background for parallelization.

I am curious: how long does the job run before it fails and reports this error message?

Could you please re-run the command, adding the --verbose option to print the full error message? (or alternatively there should be a log file with the full error message, if that file still exists just open that and share the full message here)

The full error message will help us debug. But moreover I can already tell that this is a joblib issue — I have a hunch that one or more jobs are failing (maybe because of an issue with retrieving a specific accession from the server) and turning up empty results, which causes the entire thing to crash. Once you replicate the error, please also try re-running the job with --p-n-jobs 1 to see if the problem resolves.

Thanks!

klunn · April 9, 2025, 7:22pm

@Nicholas_Bokulich
It runs for about 5 minutes. Here is the contents of the error message file:

WARNING:2025-04-09 15:08:51,299:MainProcess:This query could result in more than 100 requests to NCBI. If you are not running it on the weekend or between 9 pm and 5 am Eastern Time weekdays, it may result in NCBI blocking your IP address. See Policies and Disclaimers - NCBI for details.
WARNING:2025-04-09 15:13:27,557:LokyProcess-4:Expected 5000 sequences in this chunk, but got 4999. I do not know why, or which sequences are missing.
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 661, in wait_result_broken_or_wakeup
result_item = result_reader.recv()
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/multiprocessing/connection.py", line 421, in _recv_bytes
return self._recv(size)
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/multiprocessing/connection.py", line 386, in _recv
buf.write(chunk)
MemoryError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/q2cli/commands.py", line 530, in call
results = self._execute_action(
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/q2cli/commands.py", line 602, in _execute_action
results = action(**arguments)
File "", line 2, in get_ncbi_data
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/qiime2/sdk/action.py", line 299, in bound_callable
outputs = self.callable_executor(
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/qiime2/sdk/action.py", line 570, in callable_executor
output_views = self._callable(**view_args)
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/rescript/ncbi.py", line 83, in get_ncbi_data
seqs, taxa = _get_ncbi_data(query, accession_ids, ranks, rank_propagation,
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/rescript/ncbi.py", line 122, in _get_ncbi_data
seqs, taxids = get_data_for_query(
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/rescript/ncbi.py", line 397, in get_data_for_query
chunky = parallel(delayed(_get_query_chunk)(chunk, params, entrez_delay,
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/parallel.py", line 2007, in call
return output if self.return_generator else list(output)
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/parallel.py", line 1650, in _get_outputs
yield from self._retrieve()
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/parallel.py", line 1754, in _retrieve
self._raise_error_fast()
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/parallel.py", line 1789, in _raise_error_fast
error_job.get_result(self.timeout)
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/parallel.py", line 745, in get_result
return self._return_or_raise()
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/parallel.py", line 763, in _return_or_raise
raise self._result
joblib.externals.loky.process_executor.BrokenProcessPool: A result has failed to un-serialize. Please ensure that the objects returned by the function are always picklable.

Nicholas_Bokulich · April 10, 2025, 5:29am

Hi @klunn ,

Thanks for this info. So a few things are going on here.

First, as I suspected, the issue is (in part) with parallelization of requests, due to an overload of requests that is leading to a failure to return all requested data. First, notice the warning at the top:

and then notice this key part:

You are requesting too much data all at once, and so at least one job is failing due to a memory error, failing to return its data chunk, and the whole thing collapses.

So in a sense the issue is that your query is way too large — I discussed on the sidelines with @SoilRotifer who pointed out that your search query may be incorrect.

First, there is no target gene specified. So, they are essentially grabbing anything annotated with the give taxonomy ID.

Second, species[SCN] is incorrect.

if you copy and paste the query into GenBank nucleotide search, you'll get `Unknown field was ignored: [SCN].
I am guessing that you want only sequences that are annotated at the species-level? Not sure that is possible via an entrez query... you could of course filter like this once you have the data with rescript, but you need a manageable query first.
The results are basically returning output from everything else in your query, i.e. txid6656[ORGN] AND species NOT ..., (without the [SCN] term) which returns ~ 4,274,299 records.
There are some records with unofficial SCN labels as part of the species labels... but I would not know how to only search for those...
If the AND species[SCN] is not used then 128,681,994 records are returned!

So it seems like you are trying to pull down 128,681,994 records, leading to the whole job to crash, and the warnings above.

The solution here is (1) refine your query to a smaller target set. Starting with a target gene would help a lot, as right now you are pulling down everything. (2) you could try to reduce n-jobs if you keep getting this same error message, as parallelization will increase memory burden (though I am a bit surprised it would increase that much).

klunn · April 12, 2025, 7:36pm

Thank you for this, it was helpful to understand some potential issues. I've tried changing my code to the following, accounting for your suggestions to name the target region and to reduce jobs (I've tried running it from 1-8 jobs) but I'm still running into the same error:

qiime rescript get-ncbi-data
--p-query 'txid6656[ORGN] AND (cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]'
--p-n-jobs 2
--o-sequences NCBI_Arthropoda/ncbi-refseqs-unfiltered.qza
--o-taxonomy NCBI_Arthropoda/ncbi-refseqs-taxonomy-unfiltered.qza

Error from running this:

WARNING:2025-04-12 13:48:11,243:MainProcess:This query could result in more than 100 requests to NCBI. If you are not running it on the weekend or between 9 pm and 5 am Eastern Time weekdays, it may result in NCBI blocking your IP address. See Policies and Disclaimers - NCBI for details.
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 661, in wait_result_broken_or_wakeup
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/multiprocessing/connection.py", line 250, in recv
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/multiprocessing/connection.py", line 421, in _recv_bytes
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/multiprocessing/connection.py", line 386, in _recv
MemoryError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/q2cli/commands.py", line 530, in call
results = self._execute_action(
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/q2cli/commands.py", line 602, in _execute_action
results = action(**arguments)
File "", line 2, in get_ncbi_data
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/qiime2/sdk/action.py", line 299, in bound_callable
outputs = self.callable_executor(
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/qiime2/sdk/action.py", line 570, in callable_executor
output_views = self._callable(**view_args)
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/rescript/ncbi.py", line 83, in get_ncbi_data
seqs, taxa = _get_ncbi_data(query, accession_ids, ranks, rank_propagation,
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/rescript/ncbi.py", line 122, in _get_ncbi_data
seqs, taxids = get_data_for_query(
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/rescript/ncbi.py", line 397, in get_data_for_query
chunky = parallel(delayed(_get_query_chunk)(chunk, params, entrez_delay,
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/parallel.py", line 2007, in call
return output if self.return_generator else list(output)
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/parallel.py", line 1650, in _get_outputs
yield from self._retrieve()
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/parallel.py", line 1754, in _retrieve
self._raise_error_fast()
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/parallel.py", line 1789, in _raise_error_fast
error_job.get_result(self.timeout)
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/parallel.py", line 745, in get_result
return self._return_or_raise()
File "/home/klunn94/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/joblib/parallel.py", line 763, in _return_or_raise
raise self._result
joblib.externals.loky.process_executor.BrokenProcessPool: A result has failed to un-serialize. Please ensure that the objects returned by the function are always picklable.

Do you have any further suggestions?

Nicholas_Bokulich · April 13, 2025, 5:30am

Hi @klunn ,
I just checked your query on Genbank, and it's still returning 3.5 million accessions...

so even with refining your query I think the query is still too large and leading to job failure due to running out of memory.

I suggest trying a single job (though you might still get the same error or just a new error message, as this seems to be an issue with memory capacity, not with parallelization per se)

The only solution may be to refine your query further — you could split it into batches by using a more specific txid. E.g., query by one class at a time, instead of the entire phylum.

You could also check your memory use and job limits on the cluster, and discuss with your HPC admin to see if there is any remedy there.

I hope that helps!