Importing Phylogenetic Tree from MEGAN

Oops sorry, there was a typo in my command above, I've edited the commands above (specifically, it was the line that created table_ids). Sorry about that!

1 Like

print('total tree tips: %d' % len(tree_ids))
total tree tips: 2102697
print('total feature ids: %d' % len(table_ids))
total feature ids: 93

and

print(table_ids)
{'1919201', '1897049', '1843491', '1313', '1263', '1897048', '3803', '1472416', '2045012', '28026', '31979', '1681', '165186', '976', '1897032', '2759', '72025', '74426', '216851', '909932', '207244', '91061', '28890', '1680', '216816', '816', '1776382', '1301', '572511', '84999', '375288', '1924105', '186801', '841', '1897045', '3869', '1843489', '186806', '1853231', '3398', '85004', '291644', '29465', '1236', '1224', '574697', '158846', '1730', '909929', '2', '2157', '649756', '1897018', '84998', '815', '1897062', '1239', '1898205', '909656', '186826', '3870', '853', '72275', '1897035', '2235', '35493', '1485', '541000', '1300', '135622', '1678', '186803', '1262769', '1897043', '103892', '183963', '171549', '2005525', '186802', '387661', '201174', '310298', '33090', '437897', '31953', '84107', '1519438', '2742', '102106', '31977', '200643', '77133', '1760'}

Checking with a separate script if all 93 of these table IDs simply exist in the ncbi.tre file shows that they do.

1 Like

Excellent!

Okay that's good to know! But, we should verify if they are in the parsed tree, rather than just in the source file - as we discussed above, the parser has some tricky rules, and its possible that the parsed file is somehow winding up with different IDs (which is what @jwdebelius was initially suggesting above).

So with that in mind let's investigate if all of your table's feature ids are present in the tree (this should be run in the same session as the commands above):

print(table_ids <= tree_ids)
print(table_ids - tree_ids)
print([x[0] for x in zip(tree_ids, range(25))])

The first command will test if all of the table ids are found in the tree. The second command will show us which table ids are missing from the tree (which is what the error message you reported is telling us is happening when computing faith pd). The final command will show us the first 25 tree IDs, just for visual comparison.

:qiime2:

2 Likes

Hi, thanks for this. It confirmed many of the table_ids are missing from the tree.

print(table_ids <= tree_ids)
False

print(table_ids - tree_ids)
{'976', '909929', '1300', '1485', '186826', '1924105', '31979', '85004', '31953', '291644', '201174', '541000', '815', '84107', '437897', '2005525', '2235', '186802', '31977', '1678', '816', '1730', '2742', '387661', '310298', '909656', '1680', '375288', '183963', '574697', '1853231', '1760', '102106', '1313', '158846', '186801', '72275', '74426', '29465', '33090', '91061', '1681', '84999', '853', '1224', '72025', '2', '135622', '909932', '3398', '1263', '1843489', '84998', '216816', '1301', '35493', '200643', '216851', '3870', '171549', '3803', '186803', '1236', '1239', '186806', '572511', '2157', '28890', '1843491', '2759', '3869', '28026', '207244', '649756', '841'}

print([x[0] for x in zip(tree_ids, range(25))])
['1351713', '484883', '2296325', '2103500', '1162454', '2211437', '509799', '2088966', '1570467', '768846', '1457306', '2383518', '641779', '2212722', '888456', '383391', '2022826', '1782937', '1131486', '2753209', '102932', '1478069', '2322803', '2018458', '578438']

1 Like

Okay! This has been pretty helpful! Before I jump in to some details:

  1. We are not able to provide support for MEGAN - this is not a tool developed by our team
  2. I am not familiar at all with MEGAN, so please take my advice with some healthy skepticism.

At the root of this (pun intended), this is a case of mismatched IDs, as we've discussed above. You have IDs present in your feature table that are not present in your tree. I have wanted to play with MEGAN a bit, though (for my own personal gain), and I observed that when you select all the nodes in the interactive viewer, the BIOM table will include a feature for every node, including internal support nodes in the tree. The problem is that a metric like faith_pd is only operating on the tips/leaves - not the support nodes. The way I "solved" this in my own little test dataset was by selecting "All Nodes" before exporting the tree:

and "All Leaves" before exporting the table:

When I approached it this way, I had a table that consisted only of the tips/leaves, which meant that the table was "fully covered" by the tree.

Again, we aren't able to provide specific support for MEGAN, but the goal here is to make it clear that when you're importing data into QIIME 2 that has been prepared elsewhere, you might have to do some extra work to ensure that everything is in order.

I hope that helps, please keep us posted!

2 Likes

Hi @thermokarst and @jwdebelius . Thank you a million times for your extensive and patient help. There is definitely an issue with mismatched IDs, and it's clear QIIME2 is not the issue. I will continue troubleshooting it myself. Whether all nodes or only the leaves/the rank of species are selected in MEGAN, incidentally, QIIME2 throws the same error, so unfortunately it is not a super simple fix.

2 Likes

I had one more thought on this, FWIW. Certain metagenomic tools will create a table with all the levels combined. My Metaphlan table will return phylum, class, order, family, genus, and species combined. Most of the tools in QIIME assume that you're working with one taxonomic level. (Everything annotated to family, for example.) So, perhaps you want to double check the taxonomic assignments o the features that are nodes in the tree and see if they're stopping at a higher taxonomic level.

Best,
Justine

1 Like

Hello everyone. I just wanted to highlight one important point: the tree in MEGAN is not a phylogenetic tree, but a visualization of taxonomic hierarchy. As you know, quite often taxonomy and phylogeny have little to do with each other.

@fjdisofj0ew, If you plan to perform analyses using phylogenetic trees, then I propose that you use your sequence data to construct a phylogeny, i.e. check out the phylogeny tutorial. Assuming these are amplicon sequences. If not you'll have to use another tool to construct a phylogeny. This is important because the branch-lengths have no meaning in the MEGAN hierarchical taxonomy visualizer.

-Mike

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.