Create NCBI_Classifier with RESCRIPt by download FASTA-files manually

Hello,
I would like to use a QIIME2 classifier based on Genbank/NCBI for my analyses. Here I came across the “get-ncbi-data”-function of RESCRIPt. Unfortunately, I have not been able to obtain the desired data via NCBI so far, since the download always returns
"ConnectionError: HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=443): Read timed out. ". aborts.
I've already tried all sorts of “--p-n-jobs”. Now I tried to download the sequences manually. When trying to use the resulting Fasta-file I get the error message: "Taxonomy format requires at least one row of data". I know the problem is that Fasta-files are not based on a taxonomy. Now I wanted to ask if there is a way to use manually downloaded Fasta files to construct a Classidier.

I would be happy about a guide.

Many thanks in advance!

Hi @Patrick83,

What time of day are you trying to download the sequences? Often connection issues are less likely in the evening eastern time in the U.S. (9 PM - 5 AM). See the note in our help text:

Please be aware of the NCBI Disclaimer and Copyright notice
(Policies and Disclaimers - NCBI), particularly "run
retrieval scripts on weekends or between 9 pm and 5 am Eastern Time
weekdays for any series of more than 100 requests". As a rough guide, if
you are downloading more than 125,000 sequences, only run this method at
those times.

How many sequences result from your query? Often breaking up the query into smaller chunks, e.g. by taxonomic groups, etc., and then merging the resulting files together works well.

Hi,
thank you for accept my issue. As per your advice I always tried at 5 a.m. German time (corresponds to 9 p.m. U.S. time). I tried it with the bio projects from your tutorial "Using RESCRIPt to compile sequence databases and taxonomy classifiers from NCBI Genbank"

qiime rescript get-ncbi-data
--p-query '33175[BioProject] OR 33317[BioProject]'
--o-sequences ncbi-refseqs-unfiltered.qza
--o-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza

I've tried it a few times and on different jobs from 2 - 5. It breaks off after about 20 minutes with the error message.

Hi @Patrick83,

I've just ran this same command a couple of times without any issues. One of our other forum moderators is in your time zone, and also ran this command without any issue. I am wondering if this is a local connectivity issue?

Hello again,

I tried again earlier, not through our cluster systems this time, and it worked. I assume that the cluster system has some restrictions that prevent multiple connections to the NCBI server and that's why it didn't work, but that's just a guess.

Thank you once again for the help. This is a good forum!

2 Likes

Hi @Patrick83, thank you for letting us know. Glad you got this to work! :slight_smile: