Hi there,
I've been using and loving rescript
for my eDNA metabarcoding projects. I was wondering if it was possible to filter sequences returned from NCBI before downloading them.
For example, I'm doing a metabarcoding project using the 12S gene targeting fishes. One of the target species Percina brevicauda (txid163812) has its entire mitochondrial genome sequenced, but does not have individual entries for just the 12S gene. If I were to use this search, I would not get any results:
qiime rescript get-ncbi-data --p-query '(12S[Title] OR 12 rRNA[Title]) AND txid163812[ORGN]' --o-sequences 12s-p-brevicauda-seqs.qza --o-taxonomy 12s-p-brevicauda-taxonomy.qza --p-n-jobs 20
But, loosening to include any entries with 12S anywhere in the entry will yield the two complete mitochondrial genomes.
qiime rescript get-ncbi-data --p-query '12S AND txid163812[ORGN]' --o-sequences 12s-p-brevicauda-seqs.qza --o-taxonomy 12s-p-brevicauda-taxonomy.qza --p-n-jobs 20
This strategy works for a single species, but I would like to download 12S sequences from all vertebrates (`--p-query '12S AND txid7742[ORGN]') for my metabarcoding work. Running that query yields 195k sequences, including 708 whole genomes or whole chromosomes. That will take a very very long time to download, and will likely take a ton of space.
Importantly, some of those larger sequences, especially the whole mitochondria, are useful (e.g. P. brevicauda), but many of the whole genomes or whole chromosomes are not. It would be wonderful to only download sequences shorter than, for example, 100kb.
I know I can use qiime rescript filter-seqs-length
afterwards, but is there a way to do this before the sequences get downloaded? Or is there a better way to ensure you capture all of the 12S sequences on GenBank, regardless of whether they're in a whole genome, whole mitochondrion, or single 12S sequence?
Thanks for your help!