Reference Alignments: Best Way to Download Many Sequences from NCBI GenBank

alexkrohn · June 15, 2021, 8:44pm

Thanks again!

I decided to run two comparisons as we talked about: one using a reference database of all vertebrates + all refseq, and then one with just the sequences from the ~400 species of interest. I'm interested, as in the paper you sent me, just to see how many false positives I get with the restricted database.

However, it doesn't seem possible to run a giant query using rescript for 400 taxa. (I tried in the form of:

qiime rescript get-ncbi-data --p-query '("SPECIES1"[Organism] OR "Species2[Organism]" OR "Species3[Organism] ...... ) AND (16S[Title] OR 16S rRNA[Title])' --o-sequences 16s-400spp-refseqs-unfiltered.qza --o-taxonomy 16s-400spp-refseqs-taxonomy-unfiltered.qza --p-n-jobs 5

Where SPECIES1, 2, 3 etc. is the species' binomial, and the .... indicates 397 more.)

Qiime2 gave the error that they could not construct a URL with my query, which I assumed was because of the length.

So, that brings me back to my thoughts of a for loop in bash running rescript for each species, then merging using Qiime. I feel like there should be a better way! Any idea what I'm missing?