Filter Sequences by Length BEFORE Downloading with get-ncbi-data

alexkrohn · May 18, 2022, 5:43pm

Hi there,

I've been using and loving rescript for my eDNA metabarcoding projects. I was wondering if it was possible to filter sequences returned from NCBI before downloading them.

For example, I'm doing a metabarcoding project using the 12S gene targeting fishes. One of the target species Percina brevicauda (txid163812) has its entire mitochondrial genome sequenced, but does not have individual entries for just the 12S gene. If I were to use this search, I would not get any results:

qiime rescript get-ncbi-data --p-query '(12S[Title] OR 12 rRNA[Title]) AND txid163812[ORGN]' --o-sequences 12s-p-brevicauda-seqs.qza --o-taxonomy 12s-p-brevicauda-taxonomy.qza --p-n-jobs 20

But, loosening to include any entries with 12S anywhere in the entry will yield the two complete mitochondrial genomes.

qiime rescript get-ncbi-data --p-query '12S AND txid163812[ORGN]' --o-sequences 12s-p-brevicauda-seqs.qza --o-taxonomy 12s-p-brevicauda-taxonomy.qza --p-n-jobs 20

This strategy works for a single species, but I would like to download 12S sequences from all vertebrates (`--p-query '12S AND txid7742[ORGN]') for my metabarcoding work. Running that query yields 195k sequences, including 708 whole genomes or whole chromosomes. That will take a very very long time to download, and will likely take a ton of space.

Importantly, some of those larger sequences, especially the whole mitochondria, are useful (e.g. P. brevicauda), but many of the whole genomes or whole chromosomes are not. It would be wonderful to only download sequences shorter than, for example, 100kb.

I know I can use qiime rescript filter-seqs-length afterwards, but is there a way to do this before the sequences get downloaded? Or is there a better way to ensure you capture all of the 12S sequences on GenBank, regardless of whether they're in a whole genome, whole mitochondrion, or single 12S sequence?

Thanks for your help!

alexkrohn · May 18, 2022, 5:47pm

As another example, Micropterus dolomieu has a whole genome, a whole mitochondrion, and partial 12S sequences present on GenBank. The only way I can think to filter these sequences to include the mitochondrion, the partial sequences and NOT the whole genome, would be to filter by sequence length. Is there a better way?

Nicholas_Bokulich · May 18, 2022, 6:04pm

Hi @alexkrohn ,
Technically any valid entrez query should work. According to the docs here, you can use the [SLEN] keyword to select a sequence length range like so:
100:1000[SLEN]

I have never tried this... want to give it a spin and let us know what you find? (tip: you can search with this query on Genbank first to confirm that you eliminate unwanted sequences, before trying to download with RESCRIPt)

alexkrohn · May 19, 2022, 4:42pm

Hi @Nicholas_Bokulich!

Thanks for tracking down [SLEN], that's exactly what I was looking for. I confirmed it worked:

# This returns nothing
qiime rescript get-ncbi-data --p-query '12S AND 1:10000[SLEN] AND txid163812[ORGN]' --o-sequences 12s-p-brevicauda-seqs.qza --o-taxonomy 12s-p-brevicauda-taxonomy.qza --p-n-jobs 20

# Changing SLEN to 1:20kb will retrieve the two 16kb mtDNA genomes
qiime rescript get-ncbi-data --p-query '12S AND 1:20000[SLEN] AND txid163812[ORGN]' --o-sequences 12s-p-brevicauda-seqs.qza --o-taxonomy 12s-p-brevicauda-taxonomy.qza --p-n-jobs 20

Much appreciated!