Make 12S reference database using Rescript

Junli_Zhang · January 17, 2021, 12:06am

Hi, all, I am so struggling to make a reference database for 12S. I am stuck on the first step of downloading sequences from NCBI.

This is my script: qiime rescript get-ncbi-data --p-query '(txid7742[ORGN] AND 12S [All]) ' --output-dir NCBIdata_12S --p-n-jobs 20, and did it after 9pm.

I just could not download it. When I was directly using virtual box, error showed out of memory. When started to use server, the error message is like this:
Plugin error from rescript:

'i' format requires -2147483648 <= number <= 2147483647

Debug info has been saved to /scratch/local/62934027/qiime2-q2cli-err-8jbwlwtg.log
Plugin error from rescript:

Maximum retries (10) exceeded for HTTP request. Persistent trouble downloading from NCBI. Last exception was
ReadTimeout: HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=443): Read timed out. (read timeout=10)

Debug info has been saved to /scratch/local/62934027/qiime2-q2cli-err-ibl9azaf.log
Does anybody have experience constructing a 12S reference database, could you please give me some suggestions?

Thanks!

Junli

Nicholas_Bokulich · January 17, 2021, 12:58pm

Hi @Junli_Zhang,
Welcome to the QIIME 2 forum!

There could be many causes for this, including firewall, transient server-side issues, or just because your query is too large.

You could test the first by seeing if a small query (e.g., specific sequence) works
You could test the second by just trying again and hoping it works this time!
But let's assume it's the third issue (too large).

Your query hits a lot of sequences in GenBank, including many full-length genomes.

You could try to focus your query a bit more, e.g., with a query like this: "txid7742[ORGN] AND 12S [TITLE] NOT mRNA[TITLE]"

that cuts down the number of sequences by about half, and drops the full-length genomes and predicted mRNA sequences (but maybe you want to keep those? I am making some assumptions), so that should limit the size of the data transfer.

Another option is to try to break up the query into batches if you can figure out a way to break it up, e.g., by downloading subclades separately.

You can see some related discussion here:

Please give that a try and let us know if you make any progress!

SoilRotifer · January 17, 2021, 4:48pm

Hi @Junli_Zhang,

I've got quite a bit of experience in making 12S rRNA gene databases (e.g. for eukaryota, and metazoa). Hopefully, we can help.

I agree with @Nicholas_Bokulich, there are things beyond our control when it comes to internet connections. Your query does not appear to result in that many sequences, but you can break it up into smaller downloadable chunks within the vertebrates.

Here is a query you can use to download Gnathostomata:

txid7776[ORGN] AND (12S OR 12S ribosomal RNA OR 12S rRNA) AND (mitochondrion[Filter] OR plastid[Filter]) NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]"

One thing to note, always make sure you have some taxonomic "out groups", or "off target" taxa, for your reference database. This will better unsure that you do not under or over classify your data.

Once you download your data as separate chunks you can merge them using the standard qiime commands:

Then you can proceed with RESCRIPt.

Keep us posted!

Junli_Zhang · January 17, 2021, 10:36pm

Thank you @SoilRotifer and @Nicholas_Bokulich, I will try to chunk my query. Will update after my try.

system · February 18, 2021, 4:36am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.