Taxonomy Levels from NCBI custom Database

Hi,
I downloaded the 16S refseqs database from NCBI for taxonomic assignment, along with the corresponding taxonomy using Entrez from the command line. However, when I checked the Taxonomy file, not all records had 7 levels, for instance, any record from the phylum Actinobacteria only has 6 levels, since Actinobacteria is both the Phylum and the Class, so it is not reported twice in the NCBI lineage. Is this a problem for the training classifier or the taxonomic assignment? or should I fill the missing levels for the incomplete records?
Also, I have seen that the GreenGenes and SILVA taxonomy files have this format:

KF494428.1.1396 D_0__Bacteria;D_1__Epsilonbacteraeota;D_2__Campylobacteria;D_3__Campylobacterales;D_4__Thiovulaceae;D_5__Sulfuricurvum;D_6__Sulfuricurvum sp. EW1

Should my custom taxonomy file look like this, with the underscores and D_0, D_1...?

For reference, this is how the taxonomy looks like:

Hi @ancazugo,
I have good news (or shameless self-promotion, depending on how you see things). We recently released a QIIME 2 plugin, RESCRIPt, that can automate the process of downloading and formatting NCBI Genbank data... using the get-ncbi-data method in RESCRIPt, you can do this much more seamlessly, and will not need to manually format your taxonomy.

Yes, uneven levels will cause problems downstream with various classification methods. You should either reformat to create even levels, or use RESCRIPt, which will do that automatically.

No, that's specific to the database and is not anything special that QIIME 2 requires. That said, having those prefixes can sometimes aid interpretation (e.g., so it is clear what rank a classification was made to).

I hope that helps!

That was freaking amazing!!! :sunglasses:
It took like 5 minutes to do what I’ve been trying to do for the past four days with Entrez.
Thank you very much for this!!!
I know you have done an extensive overview of the functions of RESCRIPt regarding the SILVA database, but I hope in the future a little tutorial on the use of NCBI is available; nevertheless, it is pretty straightforward to use.

Thanks again!!! :v:

2 Likes

Thanks @ancazugo! Glad to hear RESCRIPt was useful to you! :smile:

We are planning on releasing this and other tutorials on the forum soon, and I will link here when we do... RESCRIPt is still brand-new and we are working as quick as we can to fully document, but I figured our in-progress documentation should not slow you down! :building_construction:

1 Like

mini-update: I just added a tutorial for the get-ncbi-data method here:

Thanks @ancazugo!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.