I downloaded the 16S refseqs database from NCBI for taxonomic assignment, along with the corresponding taxonomy using Entrez from the command line. However, when I checked the Taxonomy file, not all records had 7 levels, for instance, any record from the phylum Actinobacteria only has 6 levels, since Actinobacteria is both the Phylum and the Class, so it is not reported twice in the NCBI lineage. Is this a problem for the training classifier or the taxonomic assignment? or should I fill the missing levels for the incomplete records?
Also, I have seen that the GreenGenes and SILVA taxonomy files have this format:
I have good news (or shameless self-promotion, depending on how you see things). We recently released a QIIME 2 plugin, RESCRIPt, that can automate the process of downloading and formatting NCBI Genbank data… using the get-ncbi-data method in RESCRIPt, you can do this much more seamlessly, and will not need to manually format your taxonomy.
Yes, uneven levels will cause problems downstream with various classification methods. You should either reformat to create even levels, or use RESCRIPt, which will do that automatically.
No, that’s specific to the database and is not anything special that QIIME 2 requires. That said, having those prefixes can sometimes aid interpretation (e.g., so it is clear what rank a classification was made to).
That was freaking amazing!!!
It took like 5 minutes to do what I’ve been trying to do for the past four days with Entrez.
Thank you very much for this!!!
I know you have done an extensive overview of the functions of RESCRIPt regarding the SILVA database, but I hope in the future a little tutorial on the use of NCBI is available; nevertheless, it is pretty straightforward to use.
Thanks @ancazugo! Glad to hear RESCRIPt was useful to you!
We are planning on releasing this and other tutorials on the forum soon, and I will link here when we do… RESCRIPt is still brand-new and we are working as quick as we can to fully document, but I figured our in-progress documentation should not slow you down!