Make own trained database for ITS(fungi)

MathiasLR · March 21, 2019, 10:56am

Greetings,

I want to compare 4 different ITS databases on my sequences datasets. To do so, I have downloaded several ready-to-use databases but few of them are not quite exactly with the same format so to have data that can be usable and comparable I need first to know simply what is the right format for files to be trained smoothly ;-). For training I understood that I need 1 file of sequence (.fasta) and 1 file with the corresponding taxonomy (.tax).
Regarding those 2 files here are my first questions:

For the sequence file:
1- each line corresponding to the ID need to be simple like ">ID1002" or can we have additional information along like ">ID1002 :Acremonium acutatum|n437: ITS sequence" and even worse can the ID be in the middle of the ligne like ">1|Bio|1|1147229000000043|ID1002|Acremonium acutatum|n437: ITS sequence"?
2- Does the sequences need to be in a single line and not in the multifasta format of 60nuc lines?
For the taxonomy file:
1- I understood that the format should be "ID tab k__Fungi;p__..." with the ID separated from the rest by a tabulation and within the taxonomy itself a separation with ";".
2- Depending on the taxonomy I got from different sources I sometimes have at the species level "s__GenusName_SpeciesName", or "sp__GenusName_SpeciesName" or "sp__SpeciesName" or even "s__SpeciesName". I can make them all alike that is not a problem but my question is which one to choose? Is there somewhere in Qiime2 where it needs to have only the specie name or both or "sp" or "s" as specific tag?
Thanks a lot for your help, I sincerely hope that is not a redundant question, 'cause I tried the forum to answer that and didn't find matching information.
Cheers!
Mat

Nicholas_Bokulich · March 21, 2019, 12:39pm

Hi @MathiasLR,
There are many different tools that can use taxonomy data/fasta files in QIIME 2, and some of these (since they are sometimes wrapping external tools) can be more restrictive about special characters and formats that even QIIME 2! So in general, the "cleaner" you can make your data the better.

In theory anything can be on this ID line, but some special characters (like the "|" characters separating some IDs in your examples) could cause issues. I recommend removing special characters and whitespace just to be sure (or feel free to trim after whitespace, since things like taxonomy names are unnecessary). That whole line will then be considered your ID, and should be used in the taxonomy file.

It looks like multifasta format is okay for QIIME 2, but I would personally convert to single line format because that conversion should be easy and could avoid issues with tools wrapped by QIIME 2 (see earlier warning).

QIIME 2 does not care one bit. For your own sanity, though, I recommend two things:

be consistent in whatever you use
make sure each taxon has the same number of taxonomic ranks/levels

Good luck! And please consider sharing your formatted reference databases after you are done with them — other QIIME 2 users on this forum may find them useful

MathiasLR · March 21, 2019, 2:59pm

Thanks a lot Nicolas for this quick and clear answer!
It helps a lot!
An additional query when you say:

You mean in one taxonomic file?.. but I guess that if I can have the same taxonomic levels between each DBs it would be good as well.

Thanks again and I might be back once my DBs will be ready, and yes I'll see how to share them so people don't have to do that again!

Nicholas_Bokulich · March 21, 2019, 3:07pm

Yes, within a single file the taxonomic levels should be uniform depending on which classification method(s) you plan to use (this is required by some classifiers, and is not a requirement of the format itself).

But indeed, making this uniform between databases will probably make it easier for you to compare results.

MathiasLR · March 21, 2019, 3:13pm

Perfect thanks a bunch!
!

MathiasLR · March 21, 2019, 4:28pm

Hi,
Everything is progressing fine!
I actually have another question : do I need to have taxonomy and sequences files sorted the same way for using them for the training?

colinbrislawn · March 21, 2019, 4:30pm

Hello,

When you say 'sorted', what do you mean? Do you mean that the order of the reads in the seqs file and order of the taxonomy strings in the taxonomy file are the same?

Colin

Nicholas_Bokulich · March 21, 2019, 4:32pm

No, the seqs and taxonomies do not need to have IDs in the same order

MathiasLR · March 21, 2019, 4:34pm

Hello Colin,

Well yes if my tax file as his ID sorted from ID001, ID002, ID003., etc, do I need to have my sequence file beginning with the sequence of ID001 then ID002, etc or it doesn't matter?
thanks

thermokarst · March 21, 2019, 4:44pm

Please see @Nicholas_Bokulich's answer above.

MathiasLR · April 18, 2019, 1:07pm

Hello !
The databases are done and work just fine. I'll see how to get them shared with the community in an easy way. If you have some suggestion, I'll be happy to ear them !

Thanks

colinbrislawn · April 18, 2019, 9:51pm

I've used the Open Science Framework ( osf.io ) to share scientific data before. It's free and highly recommended!

Colin

system · May 20, 2019, 4:00am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.