RESCRIPt get-ncbi-data timeout error

devonorourke · August 2, 2020, 8:45pm

@Nicholas_Bokulich @BenKaehler,
I'm getting the following error when running qiime rescript get-ncbi-data:

Plugin error from rescript:

  504 Server Error: Gateway Timeout for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi

Any ideas on how to troubleshoot? Initially, I tried running with default parameters:

qiime rescript get-ncbi-data \
--p-query '(cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Ti
it I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) AND "BARCODE"[KYWD])' \
--o-sequences ncbi_boldonly_alltaxa_seqs.qza \
--o-taxonomy ncbi_boldonly_alltaxa_taxa.qza

I thought that maybe I needed to adjust the --p-entrez-delay parameter to a larger integer value, so I reran the code with --p-entrez-delay 0.9 and got the exact same error.

I'm curious - what was the largest number of sequences you queried with this new function? It looks like this timeout process happens within about 2.5 hours of the job launching (that is, it doesn't happen right off the bat). Any concern that this isn't so much a timeout generated because of the 'number of sequences per unit time' (a rate problem) instead of the sheer 'length of time any number of requests are being asked for' (a time length problem)?

thanks!

BenKaehler · August 3, 2020, 12:42am

Thanks @devonorourke, when I tried it it did the same thing - downloaded for a couple of hours then threw that error.

I doubt that changing the entrez delay will help. I can't remember the error that you get when you make too many requests per second but I don't think it was that one.

I can tell from the logs that it is downloading the sequences successfully but fails when it asks for the taxonomies. I will try chunking up the taxonomy requests to see if it helps. It is a server side issue, though, so it might be tricky to debug.

In answer to your question, my record for the largest query so far was 189,297 sequences. (Thanks @SoilRotifer.)

devonorourke · August 3, 2020, 12:48am

Not sure if it's of any use to you, but another group used a handful of Perl scripts to mine GenBank data a few years ago. Maybe something useful in those scripts?

See here

BenKaehler · August 5, 2020, 7:43am

Thanks @devonorourke, I am fortunate that I have made our scripts work without having to resort to reading Perl. (I didn't see your post until just now.)

The new record for the number of sequences downloaded in a single call to get-ncbi-data is now 985,413.

I will work on tidying up the changes I have made and we will release a new version as soon as we can.

devonorourke · August 10, 2020, 3:49pm

. still not working...
this error was generated around noon EST today, and I got started around 7am. Maybe Ben's advice on running this only overnight or weekends is the only way to go?

Plugin error from rescript:

  HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=443): Max retries exceeded with url: /entrez/eutils/epost.fcgi?tool=qiime2-rescript&email=b.kaehler%40adfa.edu.au (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f834403aac8>: Failed to establish a new connection: [Errno 101] Network is unreachable',))

Debug info has been saved to /tmp/qiime2-q2cli-err-tou0t564.log

devonorourke · August 10, 2020, 3:51pm

Related - I tried pulling the references from NCBI that would not have the barcode keyword. It also failed today, but generated a slightly different error message. Apparently it's night time downloads only for me

qiime rescript get-ncbi-data \
> --p-query '(cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) NOT "BARCODE"[KYWD])' \
> --output-dir NCBIdata_notBOLD

Plugin error from rescript:

  'Response' object has no attribute 'code'

that log file itself showed this:

WARNING:root:This query could result in more than 100 requests to NCBI. If you are not running it on the weekend or between 9 pm and 5 am Eastern Time weekdays, it may result in NCBI blocking your IP address. See https://www.ncbi.nlm.nih.gov/home/about/policies/ for details.
Traceback (most recent call last):
  File "/home/dorourke/miniconda/envs/benrescript/lib/python3.6/site-packages/rescript/ncbi.py", line 150, in _efetch_5000
    r.raise_for_status()
  File "/home/dorourke/miniconda/envs/benrescript/lib/python3.6/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&retmode=xml&WebEnv=NCID_1_47245248_130.14.18.97_9001_1597060157_1111128562_0MetA0_S_MegaStore&query_key=1&tool=qiime2-rescript&email=b.kaehler%40adfa.edu.au&retstart=960000&retmax=5000

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dorourke/miniconda/envs/benrescript/lib/python3.6/site-packages/q2cli/commands.py", line 328, in __call__
    results = action(**arguments)
  File "<decorator-gen-179>", line 2, in get_ncbi_data
  File "/home/dorourke/miniconda/envs/benrescript/lib/python3.6/site-packages/qiime2/sdk/action.py", line 240, in bound_callable
    output_types, provenance)
  File "/home/dorourke/miniconda/envs/benrescript/lib/python3.6/site-packages/qiime2/sdk/action.py", line 383, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/home/dorourke/miniconda/envs/benrescript/lib/python3.6/site-packages/rescript/ncbi.py", line 77, in get_ncbi_data
    seqs, taxids = get_nuc_for_query(query, entrez_delay)
  File "/home/dorourke/miniconda/envs/benrescript/lib/python3.6/site-packages/rescript/ncbi.py", line 256, in get_nuc_for_query
    data_chunk = _efetch_5000(params, entrez_delay)
  File "/home/dorourke/miniconda/envs/benrescript/lib/python3.6/site-packages/rescript/ncbi.py", line 167, in _efetch_5000
    if e.response.code == 400:  # because of missing ids
AttributeError: 'Response' object has no attribute 'code'

BenKaehler · August 10, 2020, 11:37pm

Thanks @devonorourke, I pushed up a fix for that shortly after you made this post (@SoilRotifer found the same issue at around the same time). Hopefully it should run now. I’m just waiting on 9pm Eastern Time to give it a go.

[Correction: I didn’t notice the error in your second post. That’s a new one. I have just pushed a fix for that, too.]

I have adopted an iterative approach that will hopefully converge on a robust download system: as we find more ways that downloading from NCBI can fail, I build in contingencies to handle the failures.

Just as a correction, it isn’t my advice to run this overnight or on weekends, it is a requirement of NCBI that if you are running large queries that you do them at those times. If we ignore that they will hassle me personally first and then potentially block your IP if they are not happy with our behaviour. See down the bottom of Policies and Disclaimers - NCBI for details.

Note that for our purposes, a “large” query is one for more than around 125k sequences (because that would result in the plugin sending more than around 100 HTTP requests to NCBI).

devonorourke · August 11, 2020, 11:06am

I installed your repo at 7pm last night in a new Conda environment, ran the script at 902pm EST, and the same error message popped up this morning regarding the time out error.

Unclear now whether or not this is possible to do in a single go. I can see this working in two directions (maybe there are many others?):

The user has to break apart all the various subtaxa as part of their largest clade, and individually download those. It might be a lot of trial and error, and I think this strategy can lead to a lot of variation among databases by users simply because they forget one subtaxa or another. Nevertheless, this is exactly what I have to do with my R scrip to pull down all COI from BOLD: I have to split it up into:

All chordates
All insects except beetles, flies, moths, and wasps
All other insects
All other arthropods
All other animals not chordates or arthropods

Maybe NCBI would manage downloading something like that breakdown in time between 9pm-5am (or perhaps over the weekend)? I can see how we'd get the taxID for chordates and arthropods, but I wonder if we can set up a query that includes one larger group's taxID (arthropoda) while excluding four others (coleoptera, diptera, lepidoptera, hymenoptera).

Maybe the get-ncbi-data function could be constantly writing to a disk, so that when the program fails, at least you've downloaded some set of reads at that point. I could envision that even if I get a timeout error after running it overnight, I might still have managed to get the first 500k sequences. If I used the strategy in #1 above, maybe I could then restart the program and exclude some group I already have data for.

I'm worried now that I've run this program unsuccessfully about 3 different times over the past week that I'll get you in trouble with NCBI folks. Do you think it's best for me to wait on any further action until you have a chance to investigate, or should I try downloading again starting this Friday at 9pm EST?

BenKaehler · August 14, 2020, 7:45am

Thanks @devonorourke, sorry for the slow response.

I have pushed a new version (available in my repo or through the PR to @Nicholas_Bokulich's).

The new version is both more stable due to a big refactor and better testing and much faster because I have made the downloads run in parallel. I just downloaded and processed > 3.6 million sequences (a new record) in around 1 hour 20 minutes (in Australia). I don't know how much faster that is than it used to be because I never got it to finish before with that many sequences. At least five times faster.

So run it with --p-n-jobs of around five or more. Don't be afraid to rerun queries with NCBI, unless you do it at the wrong time. I have been downloading a lot between 9pm and 5am over the past week, usually the same queries over and over again, and noone has complained.

So I hope that helps. Please let me know how you go. Please run with --verbose --p-logging-level DEBUG and send me the error messages when it crashes!

devonorourke · August 14, 2020, 2:09pm

Great news about the updates - many thanks!

I'll try to gather data tonight. Feel free to gather some arthropod COI records on your end too if you'd interested

Hopefully I'll have some good news to report this weekend.

Cheers

BenKaehler · August 14, 2020, 8:09pm

Thanks @devonorourke, that project grew into something I wasn't expecting but I'm glad we've reached this point.

Please send me the NCBI queries you are trying to download and I will get through as many of them as I can.

devonorourke · August 14, 2020, 8:55pm

There are two related tasks that I'm trying to accomplish with NCBI data. Both involve pulling COI sequences, but differentiating the queries based upon the "BARCODE" keyword. My hope is that this would differentiate those sequences that are in GenBank that are crossreference in the BOLD database from those that are not.

The two commands below were my attempts to do that. However, for the sake of the upcoming paper, I've subsetted the COI references into two taxonomic groups: arthropods and chordates (sorry all you fancy molluscs !).

If it is sufficiently fast for you to download all the data associated with the commands below, that would be great, because it would let me look deeper into the differences among non-arthropod and non-chordate taxa. However, it might be much faster to modify what I've written below to include a taxonomy ID and specify just arthropods (ID 6656), for instance, or just chordates (ID 7711).

If the taxonomic representation of COI is anything like BOLD, the arthropods take the lions share of the data. Chordates shouldn't take very long (BOLD has a few 100k unique sequences for chordates), for instance, but there are over 3,000k sequences for arthropods). I have no idea what kind of morass is in GenBank regarding COI, but maybe by downloading all of them at once we'll find out

Thanks so much Ben! (likewise on a project growing into something you weren't expecting

gather BOLD-tagged COI data from NCBI:

qiime rescript get-ncbi-data \
--p-query '(cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) AND "BARCODE"[KYWD])' \
--output-dir NCBIdata_BOLDonly

gather nonBOLD-tagged COI data fron NCBI:

qiime rescript get-ncbi-data \
--p-query '(cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) NOT "BARCODE"[KYWD])' \
--output-dir NCBIdata_notBOLD