How to Import or Convert this file to the appropriate format to use as a reference taxonomy

JedL · February 4, 2020, 2:56am

Hello,

I have been having an issue importing this file mytaxon.txt (5.9 MB) as a reference taxonomy. The file was initially used to train an RDP classifier so I'm unsure of whether it has the right layout to be imported as is to QIIME... I was having issues with the reference sequences fasta but managed to change the case of the bases and get it uploaded as a .qza in the end.

I have got the error:

"There was a problem importing mytaxon.txt:

mytaxon.txt is not a(n) TSVTaxonomyFormat file"

From the code:

qiime tools import
--type 'FeatureData[Taxonomy]'
--input-path mytaxon.txt
--output-path co1_ref_taxonomy_TP.qza

And the error code:

There was a problem importing mytaxon.txt:

mytaxon.txt is not a(n) HeaderlessTSVTaxonomyFormat file

From the code:

qiime tools import
--type 'FeatureData[Taxonomy]'
--input-format HeaderlessTSVTaxonomyFormat
--input-path mytaxon.txt
--output-path co1_ref_taxonomy_TP.qza

I have tried converting the .txt file to a .biom file but am getting errors saying that "mytaxon.txt is not a BIOM file!"

If anyone could shed some light and give me a hand I'd be most pleased!

colinbrislawn · February 4, 2020, 2:04pm

Good morning Jed,

I can help you get started, but I'm not sure we will find an easy answer

I took a look at the first few lines of the file:

%> head mytaxon.txt 
1*cellularOrganisms*0*0*cellularOrganisms
2*Archaea*1*1*superkingdom
3*undef_Archaea*2*2*kingdom
4*Euryarchaeota*3*3*phylum
5*Halobacteria*4*4*class
6*Halobacteriales*5*5*order
7*Haloarculaceae*6*6*family
8*Halapricum*7*7*genus
9*Halapricum_salinum*8*8*species
10*Halobacteriaceae*6*6*family

Indeed, those are not TAB separated values, those are * STAR * separated values!

While you could replace the starts with tabs, I'm not sure if this is in the format you need for Qiime 2. See how closely your data matches the examples in the Training feature classifiers tutorial.

Once we have a goal, we can take a look at how best to get your file to match, or choose a different method.

Colin

JedL · February 4, 2020, 8:28pm

So working that way I have:

mytaxon.txt

mytaxon

and the 85 OTU example

85_OTU_example

It looks like they’re in completely different formats. I would have no idea how to go about getting the former into a similar format to the latter. I don’t suppose there is any way of importing an RDP classifier into QIIME2 is there?

Also, maybe should have mentioned but completely forgot to, the mytaxon file and reference sequences include sequences from GenBank and BOLD. Will I be able to train a classifier with the references coming from two different sources?

colinbrislawn · February 5, 2020, 3:14pm

Yeah, it looks like they are...

No...

Unless...
Take a look at this issue, where people ask the RDP devs about turning this format into a Qiime compatible one.

Colin

JedL · February 9, 2020, 11:16pm

Thanks for turning me onto that! Will see if I can wrap my brain around that and work it out that way.

In the meantime though I found this. It looks like the OP might have been looking to do something kind of similar to what I wanted to do?

Following their protocol I get the same error as before when trying to use the initial headers.txt file that I've made look like the example taxonomy file... Is there a reason this file isn't able to imported as a taxonomy? Have I completely misunderstood the OPs plan and it isn't applicable at all?

Thanks,

Jed

colinbrislawn · February 10, 2020, 4:02pm

Good morning Jed,

I think your issue thread is much more helpful than the one I posted. The main goal is still to get that 7 levels taxonomy file that matches the requirement for dada2 and qiime2.

I think you are off to a good start! For others on the forums, here the first few lines of that file:

>RXIO01000005	cellularOrganisms;Bacteria;undef_Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Xanthobacteraceae;Ancylobacter;Ancylobacter_aquaticus
>RXIO01000096	cellularOrganisms;Bacteria;undef_Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Xanthobacteraceae;Ancylobacter;Ancylobacter_aquaticus
>KJ592630	cellularOrganisms;Eukaryota;undef_Eukaryota;Rhodophyta;Florideophyceae;Hapalidiales;Hapalidiaceae;Mesophyllum;Mesophyllum_lichenoides
>KJ592635	cellularOrganisms;Eukaryota;undef_Eukaryota;Rhodophyta;Florideophyceae;Hapalidiales;Hapalidiaceae;Mesophyllum;Mesophyllum_lichenoides
>KJ592636	cellularOrganisms;Eukaryota;undef_Eukaryota;Rhodophyta;Florideophyceae;Hapalidiales;Hapalidiaceae;Mesophyllum;Mesophyllum_lichenoides
>KJ592638	cellularOrganisms;Eukaryota;undef_Eukaryota;Rhodophyta;Florideophyceae;Hapalidiales;Hapalidiaceae;Mesophyllum;Mesophyllum_lichenoides
>KJ592644	cellularOrganisms;Eukaryota;undef_Eukaryota;Rhodophyta;Florideophyceae;Hapalidiales;Hapalidiaceae;Mesophyllum;Mesophyllum_lichenoides

I think we are getting closer! Looks like this files starts with > instead of going directly to the read ID. It definitely has more than 7 taxonomy levels... and might even have an inconsistent number of taxonomy levels.

I think with a some sed and maybe cut, you can make these files match! Let me know what you try!

Colin

thermokarst · February 10, 2020, 4:53pm

@colinbrislawn, the link you posted is about the structure of the file format. The link @JedL provided demonstrates a potential script for parsing and reformatting that file. With the first link, a user would be armed with the general knowledge to implement the script at the second link.

JedL · February 10, 2020, 11:40pm

Thanks for the guidance, I was able to get it imported as a Taxonomy just now!

Hopefully it will be useful to me!

The final bits and bobs were just removing the first two layers of classification (e.g. cellularOrganisms;Bacteria) and the '>' before the ID and editing a couple of the IDs so they weren't recognised as matching (changed in the ref seqs also, I feel like I shouldn't have done that but hey ho ). I'll see if it works now....