I have been having an issue importing this file mytaxon.txt (5.9 MB) as a reference taxonomy. The file was initially used to train an RDP classifier so I'm unsure of whether it has the right layout to be imported as is to QIIME... I was having issues with the reference sequences fasta but managed to change the case of the bases and get it uploaded as a .qza in the end.
Indeed, those are not TAB separated values, those are * STAR * separated values!
While you could replace the starts with tabs, I'm not sure if this is in the format you need for Qiime 2. See how closely your data matches the examples in the Training feature classifiers tutorial.
Once we have a goal, we can take a look at how best to get your file to match, or choose a different method.
It looks like they’re in completely different formats. I would have no idea how to go about getting the former into a similar format to the latter. I don’t suppose there is any way of importing an RDP classifier into QIIME2 is there?
Also, maybe should have mentioned but completely forgot to, the mytaxon file and reference sequences include sequences from GenBank and BOLD. Will I be able to train a classifier with the references coming from two different sources?
Thanks for turning me onto that! Will see if I can wrap my brain around that and work it out that way.
In the meantime though I found this. It looks like the OP might have been looking to do something kind of similar to what I wanted to do?
Following their protocol I get the same error as before when trying to use the initial headers.txt file that I've made look like the example taxonomy file... Is there a reason this file isn't able to imported as a taxonomy? Have I completely misunderstood the OPs plan and it isn't applicable at all?
I think your issue thread is much more helpful than the one I posted. The main goal is still to get that 7 levels taxonomy file that matches the requirement for dada2 and qiime2.
I think you are off to a good start! For others on the forums, here the first few lines of that file:
I think we are getting closer! Looks like this files starts with > instead of going directly to the read ID. It definitely has more than 7 taxonomy levels... and might even have an inconsistent number of taxonomy levels.
I think with a some sed and maybe cut, you can make these files match! Let me know what you try!
@colinbrislawn, the link you posted is about the structure of the file format. The link @JedL provided demonstrates a potential script for parsing and reformatting that file. With the first link, a user would be armed with the general knowledge to implement the script at the second link.
Thanks for the guidance, I was able to get it imported as a Taxonomy just now!
Hopefully it will be useful to me!
The final bits and bobs were just removing the first two layers of classification (e.g. cellularOrganisms;Bacteria) and the '>' before the ID and editing a couple of the IDs so they weren't recognised as matching (changed in the ref seqs also, I feel like I shouldn't have done that but hey ho ). I'll see if it works now....