Using RESCRIPt with STIRRUPS

Hello

I try to follow [Using RESCRIPt to compile sequence databases and taxonomy classifiers from NCBI Genbank] for STIRRUPS. There are indeed 973 accessions (with some having no "id" and "GenBank_Accession_Number".). I converted the download to .txt and named it "stirrups-accessions.txt" accordingly.

When running
qiime rescript get-ncbi-data
--p-query '33175[BioProject] OR 33317[BioProject]'
--m-accession-ids-file stirrups-accessions.txt
--o-sequences ncbi-refseqs-unfiltered.qza
--o-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza
(understandably), I got the following error:
There was an issue with loading the file stirrups-accessions.txt as metadata:

There was an issue with loading the metadata file:

Metadata IDs must be unique. The following IDs are duplicated: '-'

So I just deleted those rows with missing "id" and "GenBank_Accession_Number" and reran the above script. This time, I got the following error:
Plugin error from rescript:

Partial download. Expected 939 records, but got 938.
More than 10 ids were missing. Ten were: 219857437, 125487083, 265678780, 265679029, 343200178, 285162865, 307816494, 301072779, 116054477, 10862897.

How to solve this issue? What is the correct way to solve those rows with missing "id" and "GenBank_Accession_Number"?

Thank you.

Best Regards
Stephanie

Hi @Stephanie,

There are a few issues here.

  1. Note that not all of the items outlined in the supplementary spreadsheet provided by Fettweis et al. 2012 contains valid IDs.

    So, I sorted the filed such that all of the Accession IDs with -were sorted at the top. Then I copied the rest of the items with valid IDs to a text file. Then I tried fetching data from GenBank.

    I also ran into a similar issue as you did:

    Partial download. Expected 938 records, but got 937.
    The following ids were missing: AF222894, CU928158.1.

  2. I inspected these Accessions on GenBank and found some errors in the spreadsheet. These are items that might have been curated since the original upload, as they are old. That is AF222894 should be AF222894.1 and CU928158.1 should be CU928158. Here is the file:
    stirrups.txt (10.0 KB)

Once I made the above corrections all the data downloaded just fine. You should be able to follow the remainder of the instructions outlined in the tutorial or perform any other curation steps you require.

-Cheers!
-Mike

2 Likes

Hi Mike

Your suggestions solved my problem. Thanks!!

Best Regards
Stephanie

1 Like