I want to compare 4 different ITS databases on my sequences datasets. To do so, I have downloaded several ready-to-use databases but few of them are not quite exactly with the same format so to have data that can be usable and comparable I need first to know simply what is the right format for files to be trained smoothly ;-). For training I understood that I need 1 file of sequence (.fasta) and 1 file with the corresponding taxonomy (.tax).
Regarding those 2 files here are my first questions:
For the sequence file:
1- each line corresponding to the ID need to be simple like “>ID1002” or can we have additional information along like “>ID1002 :Acremonium acutatum|n437: ITS sequence” and even worse can the ID be in the middle of the ligne like “>1|Bio|1|1147229000000043|ID1002|Acremonium acutatum|n437: ITS sequence”?
2- Does the sequences need to be in a single line and not in the multifasta format of 60nuc lines?
For the taxonomy file:
1- I understood that the format should be “ID tab k__Fungi;p__…” with the ID separated from the rest by a tabulation and within the taxonomy itself a separation with “;”.
2- Depending on the taxonomy I got from different sources I sometimes have at the species level “s__GenusName_SpeciesName”, or “sp__GenusName_SpeciesName” or “sp__SpeciesName” or even “s__SpeciesName”. I can make them all alike that is not a problem but my question is which one to choose? Is there somewhere in Qiime2 where it needs to have only the specie name or both or “sp” or “s” as specific tag?
Thanks a lot for your help, I sincerely hope that is not a redundant question, 'cause I tried the forum to answer that and didn’t find matching information.
Cheers!
Mat
Hi @MathiasLR,
There are many different tools that can use taxonomy data/fasta files in QIIME 2, and some of these (since they are sometimes wrapping external tools) can be more restrictive about special characters and formats that even QIIME 2! So in general, the “cleaner” you can make your data the better.
In theory anything can be on this ID line, but some special characters (like the “|” characters separating some IDs in your examples) could cause issues. I recommend removing special characters and whitespace just to be sure (or feel free to trim after whitespace, since things like taxonomy names are unnecessary). That whole line will then be considered your ID, and should be used in the taxonomy file.
It looks like multifasta format is okay for QIIME 2, but I would personally convert to single line format because that conversion should be easy and could avoid issues with tools wrapped by QIIME 2 (see earlier warning).
QIIME 2 does not care one bit. For your own sanity, though, I recommend two things:
be consistent in whatever you use
make sure each taxon has the same number of taxonomic ranks/levels
Good luck! And please consider sharing your formatted reference databases after you are done with them — other QIIME 2 users on this forum may find them useful
Yes, within a single file the taxonomic levels should be uniform depending on which classification method(s) you plan to use (this is required by some classifiers, and is not a requirement of the format itself).
But indeed, making this uniform between databases will probably make it easier for you to compare results.
Hi,
Everything is progressing fine!
I actually have another question : do I need to have taxonomy and sequences files sorted the same way for using them for the training?
When you say ‘sorted’, what do you mean? Do you mean that the order of the reads in the seqs file and order of the taxonomy strings in the taxonomy file are the same?
Well yes if my tax file as his ID sorted from ID001, ID002, ID003., etc, do I need to have my sequence file beginning with the sequence of ID001 then ID002, etc or it doesn’t matter?
thanks
Hello !
The databases are done and work just fine. I'll see how to get them shared with the community in an easy way. If you have some suggestion, I'll be happy to ear them !