Importing table and tree with spaces in IDs leads to mismatch

jwdebelius · August 24, 2021, 7:01am

I've been using the GTDB tree/kraken 2 table recently. The genome IDs contain underscores in their names. When i import my table into QIIME as a FeatureTable[Frequency] (so biom under the hood), the underscores in the name are preserved. When I import the tree (Phylogeny[Rooted]) with tip IDs that contain underscores, the underscores are replaced by spaces. This may be a scikit-bio, python API specific quirk, but it's darn obnoxious. If it's more appropriate as a scikit-bio issue, I'm happy to take it there, but QIIME 2 is where the integration breaks.

Let me know if example code, etc would help; I'm happy to share.

Best,
Justine

thermokarst · August 24, 2021, 2:35pm

Thanks @jwdebelius, this is a newick file format spec thing, and in particular scikit-bio's interpretation of the spec:

http://scikit-bio.org/docs/0.5.6/generated/skbio.io.format.newick.html#module-skbio.io.format.newick

There is a mechanism to escape any underscores so that they aren't replaced with spaces, maybe that'll help get you moving in the right direction? Keep us posted!

PS

I agree! We should think about some tooling that might help with this - perhaps an import format that applies escaping rules prior to import?

jwdebelius · August 25, 2021, 6:19am

Thanks @thermokarst!

I will pay more attention to the escape on import with scikit-bio/the python API. It would be nice for q2-cli to maintain underscores upon import; there's always the possibility that I'm an edge case, but I don't think so?
In the meantime, I just have to remember to rename all my feature IDs.

Best,
Justine

system · September 25, 2021, 12:19pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.