I’m trying to classify my ITS region 1 amplicons using the ITSoneDB (http://itsonedb.ba.itb.cnr.it/), because my samples include non-fungal eukaryotic organisms and UNITE database is fungal only.
However, ITSoneDB only provides a fasta file containing all representative sequences (see below for an example), and there is not a taxonomy file that would be needed to train a qiime classifier with. I wonder if I could still train a classifier, or if there is some other approach that I can use this representative sequences fasta file to classify my samples.
ITSoneDB representative sequences:
EU070644_ITS1_ENA|Eryngium fernandezianum|477861|ITS1 located by ENA annotation, 221bp
FJ565263_ITS1_ENA|Cuitlauzina egertonii|587970|ITS1 located by ENA annotation, 216bp
KP325085_ITS1_HMM|Heterorhabditis sp. WS1|1659303|ITS1 located by HMM annotation, 401bp
I just took a look at that database, and the annotations are a little dissapointing. It looks like you are stuck with what’s in the header.
Now because some of these are ENA sequences, you could look up the accession. For example:
EU070644 has a nice taxonomy associated with it. I don’t have any experience querying ENA, but maybe others have a way to automate this.
For the HMM inferred rep-seqs, I don’t have a good answer, because it’s not clear to me where those (original?) IDs come from.
As far as QIIME 2 is concerned, there isn’t a way out of the taxonomy file, and the FASTA headers here aren’t actually enough to go on either (probably…).
Following up on that, do you need a full taxonomy, or are the accession numbers sufficient for your purposes? If so, it’s a pretty simple script to parse the headers and grab the first section of
|-delimited values. If you were to create a “taxonomy file” which mapped the full ID to that, QIIME 2 would be satisfied, but you’d only have 1 taxonomic level (which is usually not the goal).
Sorry I don’t have better news, hopefully other’s have some experience with ENA querying.
Hey Evan thanks for your prompt response!
I also thought of paring the headers to generate a mock taxonomy file in order to train a classifier, but I was also wondering if there exists a native approach to dealing with such a file. You answered my question. I’m writing the people who maintain the database to find out how the the HMM IDs were generated.
I will look around for automation methods for querying ENA with genus/species names.
Just a followup for those who are interested in using ITSoneDB for their analysis.
I was informed by the developers of ITSoneDB that they are working on generating a taxonomy file for their ITS1 sequence database. Stay tuned for its release.
This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.