Make own trained database for ITS(fungi)

Greetings,

I want to compare 4 different ITS databases on my sequences datasets. To do so, I have downloaded several ready-to-use databases but few of them are not quite exactly with the same format so to have data that can be usable and comparable I need first to know simply what is the right format for files to be trained smoothly ;-). For training I understood that I need 1 file of sequence (.fasta) and 1 file with the corresponding taxonomy (.tax).
Regarding those 2 files here are my first questions:

  • For the sequence file:
    1- each line corresponding to the ID need to be simple like “>ID1002” or can we have additional information along like “>ID1002 :Acremonium acutatum|n437: ITS sequence” and even worse can the ID be in the middle of the ligne like “>1|Bio|1|1147229000000043|ID1002|Acremonium acutatum|n437: ITS sequence”?
    2- Does the sequences need to be in a single line and not in the multifasta format of 60nuc lines?
  • For the taxonomy file:
    1- I understood that the format should be “ID tab k__Fungi;p__…” with the ID separated from the rest by a tabulation and within the taxonomy itself a separation with “;”.
    2- Depending on the taxonomy I got from different sources I sometimes have at the species level “s__GenusName_SpeciesName”, or “sp__GenusName_SpeciesName” or “sp__SpeciesName” or even “s__SpeciesName”. I can make them all alike that is not a problem but my question is which one to choose? Is there somewhere in Qiime2 where it needs to have only the specie name or both or “sp” or “s” as specific tag?
    Thanks a lot for your help, I sincerely hope that is not a redundant question, 'cause I tried the forum to answer that and didn’t find matching information.
    Cheers!
    Mat

Hi @MathiasLR,
There are many different tools that can use taxonomy data/fasta files in QIIME 2, and some of these (since they are sometimes wrapping external tools) can be more restrictive about special characters and formats that even QIIME 2! So in general, the "cleaner" you can make your data the better.

In theory anything can be on this ID line, but some special characters (like the "|" characters separating some IDs in your examples) could cause issues. I recommend removing special characters and whitespace just to be sure (or feel free to trim after whitespace, since things like taxonomy names are unnecessary). That whole line will then be considered your ID, and should be used in the taxonomy file.

It looks like multifasta format is okay for QIIME 2, but I would personally convert to single line format because that conversion should be easy and could avoid issues with tools wrapped by QIIME 2 (see earlier warning).

QIIME 2 does not care one bit. For your own sanity, though, I recommend two things:

  1. be consistent in whatever you use
  2. make sure each taxon has the same number of taxonomic ranks/levels

Good luck! And please consider sharing your formatted reference databases after you are done with them — other QIIME 2 users on this forum may find them useful :wink:

2 Likes

Thanks a lot Nicolas for this quick and clear answer!
It helps a lot!
An additional query when you say:

You mean in one taxonomic file?.. but I guess that if I can have the same taxonomic levels between each DBs it would be good as well.

Thanks again and I might be back once my DBs will be ready, and yes I'll see how to share them so people don't have to do that again! :slight_smile:

1 Like

Yes, within a single file the taxonomic levels should be uniform depending on which classification method(s) you plan to use (this is required by some classifiers, and is not a requirement of the format itself).

But indeed, making this uniform between databases will probably make it easier for you to compare results.

1 Like

Perfect thanks a bunch!
:grinning:!

Hi,
Everything is progressing fine!
I actually have another question : do I need to have taxonomy and sequences files sorted the same way for using them for the training?

Hello,

When you say ‘sorted’, what do you mean? Do you mean that the order of the reads in the seqs file and order of the taxonomy strings in the taxonomy file are the same?

Colin

No, the seqs and taxonomies do not need to have IDs in the same order

3 Likes

Hello Colin,

Well yes if my tax file as his ID sorted from ID001, ID002, ID003., etc, do I need to have my sequence file beginning with the sequence of ID001 then ID002, etc or it doesn’t matter?
thanks

Please see @Nicholas_Bokulich’s answer above.

1 Like

Hello !
The databases are done and work just fine. I'll see how to get them shared with the community in an easy way. If you have some suggestion, I'll be happy to ear them :slight_smile: !

Thanks

I’ve used the Open Science Framework ( osf.io ) to share scientific data before. It’s free and highly recommended!

Colin

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.