Any method for converting FASTA to taxonomy (txt) format for import to FeatureData[Taxonomy]?

ojholland · May 27, 2019, 8:18am

Hi,

I'm using QIIME2 for a dietary study of a herbivorous marine gastropod using the 23S gene due to its broad applicability over many macroalgal species. Unfortunately, unlike the 16S region, there are no databases (that I know of) that are specific to the gene region I am targeting. I have built a preliminary custom database semi-manually by aligning a selection of Sanger sequenced 23S PCR products from DNA extracted from local macroalgal species and manually inputting the taxonomic classifications in the appropriate format for use in importing for FeatureData[Taxonomy].
e.g.
Unique ID followed by the classification;
|B12_E02|d_Eukaryota;k_Chromista;p_Ochrophyta;c_Phaeophyceae;o_Fucales;f_Hormosiraceae;g_Hormosira;s_Hormosira_banksii| |B5_p23SrV_A08|d_Eukaryota;k_Chromista;p_Ochrophyta;c_Phaeophyceae;o_Fucales;f_Durvillaeaceae;g_Durvillaea;s_Durvillaea_potatorum| |B4_p23SrV_A07|d_Eukaryota;k_Chromista;p_Ochrophyta;c_Phaeophyceae;o_Fucales;f_Seirococcaceae;g_Phyllospora;s_Phyllospora_comosa| |B13_F02|d_Eukaryota;k_Chromista;p_Ochrophyta;c_Phaeophyceae;o_Laminariales;f_Alariaceae;g_Undaria;s_Undaria_pinnatifida|

Currently my database works well and is adhering to expected outcomes, but I would like to bolster it to improve the many gaps it is likely to have. I created the current taxonomy txt file in the format above completely by hand (too slow!) and want to be able to complement many sequences in this format from a list of BLAST results.

Is there a way to convert an aligned FASTA file of multiple sequences from a blast search to a complementary txt file in the format above to be used in creating a custom database?

I'm still pretty green when it comes to using bioinformatic tools so please be patient with my ignorance!
Thanks in advance!

Nicholas_Bokulich · May 27, 2019, 1:04pm

Check out the SILVA database — in addition to the SSU database that is most commonly used, they have a LSU database, which would include 23S. What I do not know is if it is comprehensive enough for your purposes (e.g., has all the species you expect to see). I'd start there — building on an existing database is always much easier than building a new one from scratch!

what do the header lines look like in that fasta? If it is something like:

>sequenceID| d_Eukaryota;k_Chromista;p_Ochrophyta;c_Phaeophyceae;o_Fucales;f_Hormosiraceae;g_Hormosira;s_Hormosira_banksii
ACGTGTAGTGTGCTGTAGTAGTCGTGAC

then you can easily do this with some bash commands. Something like this:

grep '>' pathto.fasta | tr -d '>' | tr '|' '\t'

But if the full taxonomy string is not in the header line of the fasta then that will not help.

ojholland · May 30, 2019, 3:13am

Thanks for your reply @Nicholas_Bokulich .

Nicholas_Bokulich:

then you can easily do this with some bash commands. Something like this:
grep '>' pathto.fasta | tr -d '>' | tr '|' '\t'
But if the full taxonomy string is not in the header line of the fasta then that will not help.

Unfortunately my taxonomy strings are not in the headers of my FASTA sequences, only my unique sequence IDs. I was wishfully thinking that there might be a way to siphon out the taxonomy strings of multiple (or even singular, one by one) individuals from the NCBI database using their respective Taxonomy IDs. I suppose this is more of a general question about the capabilities of NCBI BLAST but I suspected if I figure there has got to be some method streamlining the process, as I would hardly think that databases such as SILVA are compiled manually. At the very least I was hopeful that there was a way to generate a taxonomy string from either an accession number or a TaxID.

I will give the SILVA database a try but I suspect that its not going to be suitable for my sequences just after CTRL + F searching on the taxonomy text files for some relevant taxa and coming up without matches.