RESCRIPT error: Download did not finish. Reason unknown

avtober · March 21, 2021, 7:51am

Hi, I tried to use this method to get NCBI data for LSU 28S region and I used the following command line:

qiime rescript get-ncbi-data
--p-query '(LSU[TITLE] OR 28S[TITLE] or large ribosomal subunit[TITLE] NOT uncultured[TITLE] NOT unidentified[TITLE] NOT unclassified[TITLE] )'
--o-sequences ncbi-LSU-seqs-unfiltered.qza
--o-taxonomy ncbi-LSU-taxonomy-unfiltered.qza

looking on NCBI Blast this should be around 700,000 sequences. When I ran the command i got this error message:

      Plugin error from rescript:

Download did not finish. Reason unknown.

Debug info has been saved to /tmp/qiime2-q2cli-err-8740mbw3.log

Do you know what this might mean?

Thanks, A

Nicholas_Bokulich · March 21, 2021, 10:36am

No, but it is probably just a server-side issue and might resolve if you try again later

For context, that error message is coming from RESCRIPt... it attempts to diagnose what trouble you are having, and will automatically retry a few times... but sometimes it simply does not work, often due to transient issues connecting to the database.

avtober · March 23, 2021, 9:50am

Hi Nicolas,

Thanks for your reply. I have tried a few times and still no luck, I just realised that I want the sequences from the Nucleotide database but I have not specified this in the command so I am assuming it will be be trying to download from all the databases. How would I specify to look in just the Nucleotide database?

Thanks
A

SoilRotifer · March 23, 2021, 2:13pm

If you read the help documentation by entering the command:

qiime rescript get-ncbi-data --help

You'll see that the help text for both the --p-query and --m-accession-ids-file explicitly states that only data from the Nucleotide database will be queried and downloaded. This may change in future updates, in which case that help text will be updated.

SoilRotifer · March 23, 2021, 2:18pm

@avtober, I forgot to mention that you should be aware that the help text also has other helpful information, e.g.:

Please be aware of the NCBI Disclaimer and Copyright notice
(Policies and Disclaimers - NCBI), particularly "run
retrieval scripts on weekends or between 9 pm and 5 am Eastern Time
weekdays for any series of more than 100 requests". As a rough guide, if
you are downloading more than 125,000 sequences, only run this method at
those times...

Which could also impact your ability to download data.

One other thought, download your data in chunks as outlined here:

by querying separate taxonomic groups. As an example, I ran the following command to download only Rotifera sequences and it worked:

$ qiime rescript get-ncbi-data \
	--p-query 'txid10190[ORGN] AND (LSU[TITLE] OR 28S[TITLE] or large ribosomal subunit[TITLE] NOT uncultured[TITLE] NOT unidentified[TITLE] NOT unclassified[TITLE])' \
	--o-sequences ncbi-LSU-rotifera-seqs-unfiltered.qza \
	--o-taxonomy ncbi-LSU-rotifera-taxonomy-unfiltered.qza \
	--verbose

Saved FeatureData[Sequence] to: ncbi-LSU-rotifera-seqs-unfiltered.qza
Saved FeatureData[Taxonomy] to: ncbi-LSU-rotifera-taxonomy-unfiltered.qza

Since you've listed 28S as your LSU of interest, I assumed you only wanted to download data from within the Eukaryota, that is 23S is the LSU for Bacteria / Archaea, which you did not list. Below would be the command for downloading Eukaryote LSU sequences. Note, this may still be too large of a query, and I'd suggest downloading in chunks as mentioned above.

qiime rescript get-ncbi-data \
	--p-query 'txid2759[ORGN] AND (LSU[TITLE] OR 28S[TITLE] or large ribosomal subunit[TITLE] NOT uncultured[TITLE] NOT unidentified[TITLE] NOT unclassified[TITLE])' \
	--o-sequences ncbi-LSU-eukaryota-seqs-unfiltered.qza \
	--o-taxonomy ncbi-LSU-eukaryota-taxonomy-unfiltered.qza \
	--verbose

You can search the NCBI Taxonomy page to figure out what the txid for a given group is.

Finally, you can simply use RESCRIPt to download the LSU data from SILVA (an update was recently pushed to the GitHub code to do this for SILVA ver 138.1). Then you can run the following:

qiime rescript get-silva-data \
	--p-version  '138.1' \
	--p-target 'LSURef_NR99' \
	--p-include-species-labels \
	--p-ranks domain domain superkingdom kingdom subkingdom superphylum phylum subphylum infraphylum superclass class subclass infraclass superorder order suborder superfamily family subfamily genus  \
	--p-rank-propagation \
	--output-dir silva-138.1-LSU

Note, I listed all available taxonomic ranks to be parsed, as I am not sure which would be most helpful for you in this case. You can remove the ranks that you do not need. Any empty ranks will be filled in by the nearest upper-level rank.

-Cheers!

avtober · March 23, 2021, 5:30pm

Hi @SoilRotifer,

Thanks for that I will have a go at downloading in chunks and merging back together. I will let you know how it goes.

Best
Anya

avtober · March 23, 2021, 11:46pm

Hi again, so I just tried to download in chunks and first tried plants using the following script:

qiime rescript get-ncbi-data
--p-query 'txid33090[ORGN] AND (LSU[TITLE] OR 28S[TITLE] OR large ribosomal subunit[TITLE] NOT uncultured[TITLE] NOT unidentified[TITLE] NOT unclassified[TITLE] NOT unverified[TITLE])'
--p-n-jobs 5
--o-sequences ncbi-LSU-seqs-plants.qza
--o-taxonomy ncbi-LSU-taxonomy-plants.qza

This worked well so I then tried to do the same for protists, for this I had to use a few different IDs and ran the following script:

qiime rescript get-ncbi-data
--p-query 'txid554915[ORGN] OR txid2686027[ORGN] OR txid554296[ORGN] OR txid1401294[ORGN] OR txid2608240[ORGN] OR txid3027[ORGN] OR txid2611352[ORGN] OR txid38254[ORGN] OR txid2608109[ORGN] OR txid2611341[ORGN] OR txid2763[ORGN] OR txid2698737[ORGN] OR txid2683617[ORGN] AND (LSU[TITLE] OR 28S[TITLE] OR large ribosomal subunit[TITLE] NOT uncultured[TITLE] NOT unidentified[TITLE] NOT unclassified[TITLE] NOT unverified[TITLE])'
--p-n-jobs 5
--o-sequences ncbi-LSU-seqs-protists.qza
--o-taxonomy ncbi-LSU-taxonomy-protists.qza
--verbose

but this did not work and I got the following error message:

Traceback (most recent call last):
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/q2cli/commands.py", line 329, in call
results = action(**arguments)
File "", line 2, in get_ncbi_data
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
output_types, provenance)
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in callable_executor
output_views = self._callable(**view_args)
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/rescript/ncbi.py", line 89, in get_ncbi_data
query, logging_level, n_jobs, request_lock, _entrez_delay)
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/rescript/ncbi.py", line 371, in get_nuc_for_query
seqs[rec['TSeq_accver']] = rec['TSeq_sequence']
KeyError: 'TSeq_accver'

Plugin error from rescript:

'TSeq_accver'

See above for debug info.

Both of these were under 100,000 sequences so not too large. Any thoughts on what the problem might be?

Thanks
A

SoilRotifer · March 24, 2021, 2:53am

Hi @avtober,

When this happens, I usually suspect there are a few specific txid queries that are failing. So, I used this simple bash script, to run the qiime rescript get-ncbi-data command for each txid separately.

Save this to a file.. say get-txids.sh. Then run by typing bash get-txids.sh

txid_list="554915 2686027 554296 1401294 2608240 3027 2611352 38254 2608109 2611341 2763 2698737 2683617"

for taxa in $txid_list
	do 
		cmd="qiime rescript get-ncbi-data --p-query 'txid$taxa[ORGN] AND (LSU[TITLE] OR 28S[TITLE] OR large ribosomal subunit[TITLE] NOT uncultured[TITLE] NOT unidentified[TITLE] NOT unclassified[TITLE] NOT unverified[TITLE])' --p-n-jobs 4 --o-sequences ncbi-LSU-seqs-txid$taxa.qza --o-taxonomy ncbi-LSU-taxonomy-txid$taxa.qza"
		echo "Processing : $taxa"
		echo $cmd
		eval $cmd
	done;

From here I found that two txids failed:

2611352
2698737

I am not sure why, but when I paste the search strings (below) here, I do get a few results. I've not dug deeply into this... but at least we could narrow it down.

txid2611352[ORGN] AND (LSU[TITLE] OR 28S[TITLE] OR large ribosomal subunit[TITLE] NOT uncultured[TITLE] NOT unidentified[TITLE] NOT unclassified[TITLE] NOT unverified[TITLE])
txid2698737[ORGN] AND (LSU[TITLE] OR 28S[TITLE] OR large ribosomal subunit[TITLE] NOT uncultured[TITLE] NOT unidentified[TITLE] NOT unclassified[TITLE] NOT unverified[TITLE])

Any ideas @BenKaehler?

avtober · March 24, 2021, 2:54pm

Thanks for looking into that, I took out those two IDs and the download worked. I have also been downloading the rest of Eukaryota in small batches with less than 120,000 sequences, some have worked and some still have the error message 'download did not finish reason unknown'. It seems very random as to whether they download or not and doesn't seem to depend on size. Hopefully if I keep trying they may just all work.

Best
Anya

BenKaehler · March 24, 2021, 6:46pm

Thanks @avtober, @Nicholas_Bokulich, and @SoilRotifer for already boiling this problem down to a relatively small query.

From those errors (particularly the KeyError) it looks like we’re getting a record back that has a format we haven’t seen before.

I’ll have to debug, so it might take a day or two for me to get back to you.

avtober · March 25, 2021, 8:51am

Thanks @BenKaehler and @SoilRotifer for your help with this. I just have a couple more downloads that are still not working. The first is under 100,000 sequences and I used the following script:

qiime rescript get-ncbi-data *
** --p-query 'txid189478[ORGN] OR txid147549[ORGN] OR txid205932[ORGN] OR txid147537[ORGN] OR txid451866[ORGN] OR txid129384[ORGN] OR txid136265[ORGN] OR txid2283618[ORGN] OR txid112252[ORGN] AND (LSU[TITLE] OR 28S[TITLE] OR large ribosomal subunit[TITLE] NOT uncultured[TITLE] NOT unidentified[TITLE] NOT unclassified[TITLE] NOT unverified[TITLE])' *
** --o-sequences ncbi-LSU-seqs-fungi4.qza **
** --o-taxonomy ncbi-LSU-taxonomy-fungi4.qza **
** --verbose**

when I try to run this i get the following error message:

Traceback (most recent call last):
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/q2cli/commands.py", line 329, in call
results = action(**arguments)
File "", line 2, in get_ncbi_data
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
output_types, provenance)
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in callable_executor
output_views = self._callable(**view_args)
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/rescript/ncbi.py", line 89, in get_ncbi_data
query, logging_level, n_jobs, request_lock, _entrez_delay)
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/rescript/ncbi.py", line 365, in get_nuc_for_query
for chunk in range(0, expected_num_records, 5000))
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/joblib/parallel.py", line 1044, in call
while self.dispatch_one_batch(iterator):
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
self._dispatch(tasks)
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/joblib/parallel.py", line 777, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 572, in init
self.results = batch()
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/joblib/parallel.py", line 263, in call
for func, args, kwargs in self.items]
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/joblib/parallel.py", line 263, in
for func, args, kwargs in self.items]
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/rescript/ncbi.py", line 342, in _get_query_chunk
raise RuntimeError('Download did not finish. Reason unknown.')
RuntimeError: Download did not finish. Reason unknown.

Plugin error from rescript:

Download did not finish. Reason unknown.

See above for debug info.

The second download is larger at 123,000 sequences and I used the following script:

qiime rescript get-ncbi-data *
** --p-query 'txid6960[ORGN] AND (LSU[TITLE] OR 28S[TITLE] OR large ribosomal subunit[TITLE] NOT uncultured[TITLE] NOT unidentified[TITLE] NOT unclassified[TITLE] NOT unverified[TITLE])' *
** --p-n-jobs 5 **
** --o-sequences ncbi-LSU-seqs-metazoa1.qza **
** --o-taxonomy ncbi-LSU-taxonomy-metazoa1.qza **
** --verbose**

For this one it looks like I get a slightly different error message:

Traceback (most recent call last):
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/q2cli/commands.py", line 329, in call
results = action(**arguments)
File "", line 2, in get_ncbi_data
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
output_types, provenance)
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in callable_executor
output_views = self._callable(**view_args)
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/rescript/ncbi.py", line 89, in get_ncbi_data
query, logging_level, n_jobs, request_lock, _entrez_delay)
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/rescript/ncbi.py", line 365, in get_nuc_for_query
for chunk in range(0, expected_num_records, 5000))
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/joblib/parallel.py", line 1054, in call
self.retrieve()
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/joblib/parallel.py", line 933, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/mnt/scratch/nodelete/sbiat4/qiime_conda/envs/rescript/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
RuntimeError: Download did not finish. Reason unknown.

Plugin error from rescript:

Download did not finish. Reason unknown.

Any thoughts on this would be greatly appreciated, I am so close to having all the database downloaded!

Anya

avtober · April 6, 2021, 10:39am

Hi @BenKaehler, did you manage to get anywhere with the KeyError formatting issue?

Many thanks
Anya

Nicholas_Bokulich · April 7, 2021, 6:03am

Hi @avtober I spoke with @BenKaehler about this earlier this week and he is still investigating. Thanks for your patience!

I believe NCBI RefSeqs has an LSU reference set... it would be smaller and pre-curated, so might be a good start that should work with a simpler query (basically just the project ID you can get from the NCBI RefSeqs website, see the RESCRIPt tutorial for the 16S SSU RefSeqs example). Just an idea that might get you moving while Ben investigates this issue...

avtober · April 7, 2021, 8:36am

Hi @Nicholas_Bokulich, thanks for that suggestion, I did have a look but it seems the LSU RefSeqs are only for fungi and I am looking for eukaryotes really (parasites in snails) but would also like to include fungi and bacteria just incase. Thank you both for all your help with this, I am happy to wait and work on other things for now.

Best
Anya

SoilRotifer · April 7, 2021, 1:38pm

In that case, why not use the LSU from SILVA? See here:

You can use qiime rescript get-silva-data to download the latest v138.1 LSU database. Just set --p-version 138.1 and --p-target LSURef_NR99 or --p-target LSURef. You can follow along the in the above linked tutorial to perform further filtering and curation of the reference database.

avtober · April 7, 2021, 5:13pm

Hi @SoilRotifer, I have already tried the SILVA LSU NR99 and full database which did work, however both databases do not have all the parasite species that I am looking for. The NCBI database has a lot more parasite sequences for 28S. If I cannot get the whole NCBI database my other thought was to just supplement the SILVA databases with some of the parasite sequences from NCBI.

I am not sure exactly how to do this yet, I guess I would just download the individual sequences and copy them into the SILVA database. Is there a tutorial on this somewhere?

Best
Anya

SoilRotifer · April 7, 2021, 5:25pm

Add all of your sequences into a single FASTA file, and the associated taxonomy into another file. Then, you can simply import your sequence and taxonomy data into QIIME then merge these with the SILVA LSU qza files as you would have done when downloading separate chunks of data from genbank:

-Mike

Nicholas_Bokulich · April 7, 2021, 6:17pm

It's a good thought, but you would need to make sure that the taxonomies align (e.g., use the same lineage names and conventions). If it's a matter of adding a few species, it is probably not a big deal to manually do this. But it will be challenging if you have a large number... if so, it will probably be better (less time/effort!) to wait than to manually stitch these together.

Hopefully @BenKaehler will be able to track down why NCBI is hanging up in this case.

BenKaehler · April 8, 2021, 8:16am

Hi everyone, sorry for the slow turnaround on this one.

I have tweaked get-ncbi-data so that it now accommodates the NCBI weirdness that you found.

Once this PR is merged you should be good to go.