Taxonomic analysis of ITS1

I'm trying to classify my ITS region 1 amplicons using the ITSoneDB (http://itsonedb.ba.itb.cnr.it/), because my samples include non-fungal eukaryotic organisms and UNITE database is fungal only.

However, ITSoneDB only provides a fasta file containing all representative sequences (see below for an example), and there is not a taxonomy file that would be needed to train a qiime classifier with. I wonder if I could still train a classifier, or if there is some other approach that I can use this representative sequences fasta file to classify my samples.

Thanks!


ITSoneDB representative sequences:

EU070644_ITS1_ENA|Eryngium fernandezianum|477861|ITS1 located by ENA annotation, 221bp
tcgatgcctgcaaagcagaacgacccgcgaacacgtcaaaaataacgggcgagcggtccggggggcgcaagctccacgcgtccgcgaacccgcaggtcgagggcgtccctgggcgctcgacggccgcaaactcaccccggcgcggaatgcgccaaggaaatagaaccggactgaacgttctcgcccccgttcgcgggtggcgatggcgtctttcagaaaca
FJ565263_ITS1_ENA|Cuitlauzina egertonii|587970|ITS1 located by ENA annotation, 216bp
tcgagaccgaaaaatataccgagcgattcggacaacccgtgaaatgagggaatggccgtcccggtcgtcgcccccgactccccttcgggaggagggggcacggcggaggatggatgaaccacaaaccggcgcagcatcgcgccaagggaatattgagatgcacgagccccgcgtcgggctcggtggcgtggagtgctgttgcacgccatgcggatg
KP325085_ITS1_HMM|Heterorhabditis sp. WS1|1659303|ITS1 located by HMM annotation, 401bp
cgtcgatgccttataggtatatgctttgatcacgagatgctgataatcatggaatcaagcttgctcttgatttcagtcggtgtctcaccccatctaagctctcggagaggtgtctattcttgattggagccgatttgagtgacggcaatgataattggatatgctcccgttcggataagagcataagacttaatgagctgatctaggtctgtcgcctcaccaaaaacccatcgatagttggtggctaagtgatgagactttgtcaaaatcactaatctgctatgcggggagccttaatgagttgttcgtgtcacttggccgagacaaccgccagtatcgataaatctcttcccaattaacttgtttctagtaaaggctattgagttagtggaacattagcc

Hey @xpeng!

I just took a look at that database, and the annotations are a little dissapointing. It looks like you are stuck with what’s in the header.

Now because some of these are ENA sequences, you could look up the accession. For example:

EU070644 has a nice taxonomy associated with it. I don’t have any experience querying ENA, but maybe others have a way to automate this.

For the HMM inferred rep-seqs, I don’t have a good answer, because it’s not clear to me where those (original?) IDs come from.


As far as QIIME 2 is concerned, there isn’t a way out of the taxonomy file, and the FASTA headers here aren’t actually enough to go on either (probably…).

Following up on that, do you need a full taxonomy, or are the accession numbers sufficient for your purposes? If so, it’s a pretty simple script to parse the headers and grab the first section of |-delimited values. If you were to create a “taxonomy file” which mapped the full ID to that, QIIME 2 would be satisfied, but you’d only have 1 taxonomic level (which is usually not the goal).

Sorry I don’t have better news, hopefully other’s have some experience with ENA querying.

1 Like

Hey Evan thanks for your prompt response!

I also thought of paring the headers to generate a mock taxonomy file in order to train a classifier, but I was also wondering if there exists a native approach to dealing with such a file. You answered my question. I’m writing the people who maintain the database to find out how the the HMM IDs were generated.

I will look around for automation methods for querying ENA with genus/species names.

2 Likes

Just a followup for those who are interested in using ITSoneDB for their analysis.

I was informed by the developers of ITSoneDB that they are working on generating a taxonomy file for their ITS1 sequence database. Stay tuned for its release.

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.