RESCRIPt p-query issue from NCBI: Plugin error from rescript: taxonomy format requires at least one row of data

bkramer · August 20, 2021, 8:47pm

Hello,

I am trying to create a nifH database that's compatible with QIIME2 using RESCRIPt. I have installed the plugin in QIIME 2 version 2021.4. The ID number of the BioProject from NCBI that I'm trying to import into QIIME 2 is 418634 (ID 418634 - BioProject - NCBI). However, when I use the following code:

qiime rescript get-ncbi-data
--p-query '418634[BioProject]'
--p-n-jobs 5
--o-sequences /home/qiime2/Desktop/PostImported_Ben/ncbi-refseqs-unfiltered.qza
--o-taxonomy /home/qiime2/DesktopPostImported_Ben/ncbi-refseqs-taxonomy-unfiltered.qza

I receive the following plugin error from rescript:

"Taxonomy format requires at least one row of data"

When I swap out the BioProject ID with that of the RESCRIPt tutorial ('33175'), the data imports fine...

Is this a formatting issue with the BioProject I queried? If so, is there a way to work around this?

Any help would be greatly appreciated!!

Nicholas_Bokulich · August 21, 2021, 6:25am

Hi @bkramer ,

It looks like the issue is that you are trying to download raw reads from a research study, stored in SRA. There are evidently not any taxonomic annotations associated with these.

whereas the bioproject in the tutorial links to annotated reference sequences contained in the nucleotide archive (and more specifically this is a refseqs targeted loci project with curated annotations).

So the short explanation is that there are sequences to download in project 418634, but there are not any associated taxonomy annotations, so the action fails. This is also basically what the error message is saying, that it could not retrieve the expected annotations:

This action is specifically meant for getting annotated sequences — it is not equipped to handle sample metadata so downloading study data is not an intended use (and will obviously fail at this attempt).

We are working on such an action in QIIME 2, which should be released in the coming months, that will allow downloading study data.

RESCRIPt could be used for this purpose, but you should use a keyword query (unless if you find an appropriate bioproject query, e.g., if there is a nifH refseqs targeted loci project). Something like nifH[title] NOT uncultured[title] (note: I have not tested that query!)

Give that a try and let us know what you find... I recommend testing out a query directly on Genbank so that you can refine it further (e.g., to see a summary of how many sequences are retrieved, and the different taxonomic groups that are retrieved). Then download with RESCRIPt once you have found a query that you like!

Good luck!

bkramer · August 21, 2021, 1:51pm

Thank you Nicholas! I will give that a try.

bkramer · August 23, 2021, 10:04pm

Good afternoon,

So I tried to import data into QIIME2, though not surprisingly I think the size of the data was massive 30 GB of RAM and 4 CPUs were insufficient to complete the command.

Specifically, I used the query options:

'nifH[All Fields] AND (biomol_genomic[PROP] AND refseq[filter] AND is_nuccore[filter])'

Nicholas_Bokulich · August 24, 2021, 5:21am

how lines are in the file/what is the file size? (in your terminal use the command wc -l insert-filepath-name-here)

nifH[All Fields] is probably going to also grab any genome that has a nifH gene

bkramer · August 24, 2021, 1:47pm

Hi @Nicholas_Bokulich !

So I actually just tried your original suggestion (nifH[title] NOT uncultured[title]) and it worked! I now have unfiltered data and a taxonomy file from NCBI and qiime2. I should've just done what you suggested before

I did have a question though, as this "database" is different from the one you would get from Silva, mainly in that I'm not sure what the filtering settings should be for nifH prior to building and testing my classifier, although I'm sure that's something I can figure out with a literature search.

Thank you so much for all your help!!!

Nicholas_Bokulich · August 24, 2021, 3:22pm

glad you got it working!

Some of the same filtering should be sensible... e.g., same general quality criteria. Length filtering criteria you can pull from the literature and/or look for clear outliers (possibly after trimming by primer).

More specific I cannot advise... I have not worked with nifH before!

good luck!

bkramer · August 24, 2021, 3:55pm

Thank you! Currently extracting sequences based on the IGK3/DVV primer set...hopefully it works!

Beyond that, I was planning on dereplicating my sequence and taxonomy files, as it was necessary for the Silva tutorial for RESCRIPt. Before I do, I wanted to check whether the following commands would/should work with NCBI imported data:

--p-rank-handles 'ncbi'
--p-mode 'uniq'
--i-p-perc-identity 99

system · September 24, 2021, 9:56pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.