Any method for converting FASTA to taxonomy (txt) format for import to FeatureData[Taxonomy]?


I’m using QIIME2 for a dietary study of a herbivorous marine gastropod using the 23S gene due to its broad applicability over many macroalgal species. Unfortunately, unlike the 16S region, there are no databases (that I know of) that are specific to the gene region I am targeting. I have built a preliminary custom database semi-manually by aligning a selection of Sanger sequenced 23S PCR products from DNA extracted from local macroalgal species and manually inputting the taxonomic classifications in the appropriate format for use in importing for FeatureData[Taxonomy].
Unique ID followed by the classification;
|B12_E02|d_Eukaryota;k_Chromista;p_Ochrophyta;c_Phaeophyceae;o_Fucales;f_Hormosiraceae;g_Hormosira;s_Hormosira_banksii| |B5_p23SrV_A08|d_Eukaryota;k_Chromista;p_Ochrophyta;c_Phaeophyceae;o_Fucales;f_Durvillaeaceae;g_Durvillaea;s_Durvillaea_potatorum| |B4_p23SrV_A07|d_Eukaryota;k_Chromista;p_Ochrophyta;c_Phaeophyceae;o_Fucales;f_Seirococcaceae;g_Phyllospora;s_Phyllospora_comosa| |B13_F02|d_Eukaryota;k_Chromista;p_Ochrophyta;c_Phaeophyceae;o_Laminariales;f_Alariaceae;g_Undaria;s_Undaria_pinnatifida|

Currently my database works well and is adhering to expected outcomes, but I would like to bolster it to improve the many gaps it is likely to have. I created the current taxonomy txt file in the format above completely by hand (too slow!) and want to be able to complement many sequences in this format from a list of BLAST results.

Is there a way to convert an aligned FASTA file of multiple sequences from a blast search to a complementary txt file in the format above to be used in creating a custom database?

I’m still pretty green when it comes to using bioinformatic tools so please be patient with my ignorance!
Thanks in advance!

Check out the SILVA database — in addition to the SSU database that is most commonly used, they have a LSU database, which would include 23S. What I do not know is if it is comprehensive enough for your purposes (e.g., has all the species you expect to see). I’d start there — building on an existing database is always much easier than building a new one from scratch!

what do the header lines look like in that fasta? If it is something like:

>sequenceID| d_Eukaryota;k_Chromista;p_Ochrophyta;c_Phaeophyceae;o_Fucales;f_Hormosiraceae;g_Hormosira;s_Hormosira_banksii

then you can easily do this with some bash commands. Something like this:

grep '>' pathto.fasta | tr -d '>' | tr '|' '\t'

But if the full taxonomy string is not in the header line of the fasta then that will not help.

Thanks for your reply @Nicholas_Bokulich .

Unfortunately my taxonomy strings are not in the headers of my FASTA sequences, only my unique sequence IDs. I was wishfully thinking that there might be a way to siphon out the taxonomy strings of multiple (or even singular, one by one) individuals from the NCBI database using their respective Taxonomy IDs. I suppose this is more of a general question about the capabilities of NCBI BLAST but I suspected if I figure there has got to be some method streamlining the process, as I would hardly think that databases such as SILVA are compiled manually. At the very least I was hopeful that there was a way to generate a taxonomy string from either an accession number or a TaxID.

I will give the SILVA database a try but I suspect that its not going to be suitable for my sequences just after CTRL + F searching on the taxonomy text files for some relevant taxa and coming up without matches.