Importing Phylogenetic Tree from MEGAN

fjdisofj0ew · October 25, 2021, 5:46am

Hello,

I am trying to import phylogenetic information into QIIME2 for diversity analysis, but I am having some issues. The main error message I get is "The table does not appear to be completely represented by the phylogeny". I have a FeatureTable[Frequency] with taxonomic assignments from a .biom file I exported from MEGAN. It is in this format:

"rows":[{"metadata":{"taxonomy":["d__Bacteria"]},"id":"2"},{"metadata":{"taxonomy":["d__Bacteria","s__uncultured bacterium"]},"id":"77133"},{"metadata":{"taxonomy":["d__Bacteria","p__Bacteroidetes"]},"id":"976"},{"metadata":{"taxonomy":["d__Bacteria", ....
"columns":[{"metadata":{},"id":"ES9-1"},{"metadata":{},"id":"ES1-2"}],"matrix_type":"dense","matrix_element_type":"int","shape":[93,2],"data":[[0.0,43404.0],[12123.0,0.0],[0.0,0.0],[0.0,0.0],[352628.0,1969263.0],[0.0,0.0],[10786.0,5120.0],[23463.0,41556.0],[18384.0,0.0],[0.0,12236.0],[0.0,0.0],[0.0,0.0],[14066.0,0.0],[0.0,0.0],[73603.0,234769.0],[35838.0,11903.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[1243603.0,1729832.0],[149361.0,57626.0],[34720.0,0.0],[40593.0,29055.0],[53655.0,57874.0],[0.0,96118.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[5441.0,73132.0],[8567.0,0.0],[28942.0,207225.0],[0.0,0.0],[0.0,0.0], ....

I try to export a .tre file from MEGAN and use it with this frequency table, but the error I mentioned above appears. I also tried using the ncbi.tre file from MEGAN, but the same issue occurs. The files are below. I think the issue is incompatibility between how taxonomies are represented (ID versus written out), but I am not sure. I am not looking for an exact recipe for importing these phylogeny files, but I am trying to diagnose the issue. Does anyone know what the problem is?
Comparison.tre (2.7 KB)
ncbi.tre.gz (6.0 MB)

jwdebelius · October 25, 2021, 7:46am

Hi @fjdisofj0ew,

I think is related to a... let's not call it a bug, but a particular feature of the QIIME 2 import pipeline that casts underscores in tree ids to spaces. My solution has been to transform my ids before import. (I do a find and replace and convert "_" to "-" in both my table and tree.

It may not exactly be your error, but worth a try.

Best,
Justine

fjdisofj0ew · October 25, 2021, 1:55pm

Hi @jwdebelius, thank you so much for this explanation! That makes a lot of sense. Thanks again for your help.

fjdisofj0ew · November 4, 2021, 5:25pm

Hi! Just to follow up, I am not convinced the underscores being cast to spaces is the issue. I am confused as to what the problem is, actually. Here is my workflow:

Download .biom file from MEGAN -> convert .biom file to FeatureTable[Frequency]

qiime tools import \
  --input-path Comparison-Taxonomy.biom \
  --type 'FeatureTable[Frequency]' \
  --input-format BIOMV100Format \
  --output-path Comparison-Taxonomy.qza

This feature table lists the features as their NCBI taxonomic IDs and not as their written-out names (e.g. 291644 instead of d__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides salyersiae). I know this because when I use feature-table summarize to convert the FeatureTable[Frequency] to a .qzv file, the features are listed as these numbers rather than the taxa names.

(I don't know if this is useful information, but to make bar plots, I created a file I named taxonomy.txt with each taxon ID associated with its written-out name and imported it as FeatureData[Taxonomy]. I then supply it in the --i-taxonomy option for qiime taxa bar plot so the bar plot has names instead of numeric IDs.)

Here is the strange part. I wrote a short script to take every numeric ID present in my .biom file and see if it is present in the ncbi.tre file. The ncbi.tre file, attached earlier above, only lists phylogeny with these numeric IDs rather than spelled-out names. Every single ID present in the .biom file is also present in the ncbi.tre file. So this is where I am stumped as to why there is an issue.

Here is how I am importing the ncbi.tre file, which is in Newick format and is a rooted tree.:

qiime tools import \
  --input-path ncbi.tre \
  --output-path rooted-tree.qza \
  --type 'Phylogeny[Rooted]'

qiime diversity core-metrics-phylogenetic \
  --i-phylogeny rooted-tree.qza \
  --i-table Comparison-Taxonomy.qza \
  --p-sampling-depth 6655889 \
  --m-metadata-file Comparison-metadata.txt \
  --output-dir core-metrics-results

Any insight or ideas would be so so appreciated.

jwdebelius · November 10, 2021, 1:36pm

Hi @fjdisofj0ew,

If you look at the first few lines of the table, what IDs are present there? You could run something like

head ncbi.tre

to check.

The other option is that the tree covers some, but not all of the tips. (I've also had this issue recently). I might trouble shoot that by checking how filtering changes the feature table.

Best,
Justine

fjdisofj0ew · November 10, 2021, 5:53pm

The file is quite large, but the beginning starts like this (head ncbi.tre):

((((((1985431)1985430,(1985433)1985432,(1660251)1660250,((2790964)2790963)2790962)458031,((((1335627,1615681,946336,2218256,1335629,542300,(1661951)1661950,(1872106,1078845)2669388)1078830,(1560006,768535,(1401345)1401344,(1872113,2303751,2305226)2638815)1000999,(2211140,2608730)2608728,(1382359,(240015)33075,(1380837,114706,115380,115381,115382)114705,(1872119,2763107,433983,1336583,192103,191847,191869,2505759,1082710,1671493,747482,1454928,1641853,1641854,939176,939177,939178,939179,939180,939181,939183,1391499,146085)2625945)33973,(863522,(1401343)1401342,(1641857)2627905)1078860,((204669)658062,(705306)705305,(2510641)2677081)658061,(2042962,(2771375,2043160,2771378,2771377,2771376)2635724)2136116,(2043161)2136117,(2040573,(1121860)570835,2259016,1560005,1933044,2051959,388466,(492480)492478,(1934404,2703788,2763071,577536,1933043,1378307,1896652,1641863,1703116,1703129,1933045)2637509)388463,(871561,655996,655991,194844,134726,134727,78975,78941,78958,90880,90882,90881,90879)112075,((1241325)1241324,1335628,474949,940613,741063,(682795)940614,474951,474950,474952,940616,2479048,(1198114)940615,(1969471,1617967,2485170,1270661,1270662,1270663,1277346,1641865,1747222,1641866,1322326,1747224,1747226,1660100,1911516,1946580,1946581,2602070)2621151)940557,(1002689,1002691)1742983,((1775470)1775469,1577686)1775468,((1401347)1401346,474953,(2171561,1338507,1338508,2171563)2634852)1078829,(1577687,(1964191)2634994)1768185,((626101)626102,1592106,940139,(926566)392734,(401053)870903,1111115,(1889013,1238175,1713183,1713184,1190247,1641873,1527260,1384625,278961,278962,1534220)2628988)392733,(639029,639031,639032,2052142,2651330,2651329,1002692,1078886,1671529,367933,363880,363879,363878,1078887,639030,1267533,1267534,298608,1387376,1387377,1387378,1387379,1229651,1671530,713466,1402914,1402920,298609,278963,424383,1951344,1951345,1380348,542301,542302,542303,542304,542305,542306,542307,542308, ...

jwdebelius · November 11, 2021, 9:31am

Hi @fjdisofj0ew,

So, looking at your tree file, it looks like it has numeric IDs. ...So it's probably not a direct mismatch between the IDs.

Have you tried filtering the table to see how much you keep? (Try the q2-phylogeny plugin for ways to filter with trees).

Best,
Justine

fjdisofj0ew · November 18, 2021, 7:02pm

Hi @jwdebelius

After filtering with

qiime phylogeny filter-table
--i-table Comparison-Taxonomy.qza
--i-tree rooted-tree.qza
--o-filtered-table filtered-Comparison-Taxonomy.qza

I realized that QIIME2 is only recognizing the numeric IDs in the ncbi.tre file that are surrounded with commas (e.g. "60614,60615,60616,60600,60621"). If there are parentheses adjacent, it fails to see them (e.g. "(1852370,(1926673)2625192)"). How should I modify the .tre file to preserve the phylogenetic relationships but allow QIIME2 to recognize the IDs?

thermokarst · November 18, 2021, 7:13pm

Hi @fjdisofj0ew, I don't mean to step on @jwdebelius's toes, but thought this might be useful for you:

http://scikit-bio.org/docs/0.5.6/generated/skbio.io.format.newick.html#explanation

QIIME 2 uses scikit-bio for working with trees, so all of the rules outlined in the link above are going to apply here. I'll draw your attention to this passage:

More characters can be used to create more descriptive labels. When creating a label there are some rules that must be considered due to limitations in the Newick format. The following characters are not allowed within a standard label: parenthesis, commas, square-brackets, colon, semi-colon, and whitespace.

What if these characters are needed within a label? In the simple case of spaces, an underscore (_) will be translated as a space on read and vice versa on write.
What if a literal underscore or any of the others mentioned are needed? A label can be escaped (meaning that its contents are understood as regular text) using single-quotes ('). When a label is surrounded by single-quotes, any character is permissible. If a single-quote is needed inside of an escaped label or anywhere else, it can be escaped with another single-quote. For example, A_1 is written 'A_1' and 'A'_1 would be '''A''_1'.

Oof, no fun.

fjdisofj0ew · November 18, 2021, 7:58pm

Hi @thermokarst Thank you for this information! Looking at the newick format explanation, I see it's true parenthesis are not allowed in labels.

The following characters are not allowed within a standard label: parenthesis, commas, square-brackets, colon, semi-colon, and whitespace

But aren't parenthesis used to show phylogenetic relationships?

To provide these relationships, there is another structure: paired parenthesis (( ) ). These are inserted at the location of an existing node and give it the ability to have children. Placing ( ) in a node’s location will create a child inside the parenthesis on the left-most inner edge.

So the ncbi.tre file seems to be in proper Newick format.

thermokarst · November 22, 2021, 2:56pm

Yes indeed, but keep in mind that a node label is not the same thing as the tree itself - its just arbitrary metadata tacked onto the tree.

fjdisofj0ew · December 2, 2021, 9:20pm

I'm not sure I understand. The numeric IDs are the tips/node labels, and the parentheses are the internal nodes. Are you saying the number chosen to label the leaf is "arbitrary metadata"? In that case, I am still not sure what is incorrect about the ncbi tree.

thermokarst · December 2, 2021, 9:33pm

I probably misunderstood your earlier comment:

My understanding was that this meant that some of the tips had parentheses as part of the actual tip label.

Can you please share the complete error message you're observing (or point me to it above, in case I missed it)? Rerun with --verbose or upload the error log. Thanks!

fjdisofj0ew · December 3, 2021, 3:33am

Thanks for your help.

If I run something like:

qiime diversity core-metrics-phylogenetic
--i-phylogeny rooted-tree.qza
--i-table Comparison-Taxonomy.qza
--p-sampling-depth 398283
--m-metadata-file Comparison-metadata.txt
--output-dir core-metrics-results
--verbose

I get:
.../.conda/envs/qiime2-2021.8/lib/python3.8/site-packages/sklearn/metrics/pairwise.py:1776: DataConversionWarning: Data was converted to boolean for metric jaccard
** warnings.warn(msg, DataConversionWarning)**
Traceback (most recent call last):
** File ".../.conda/envs/qiime2-2021.8/lib/python3.8/site-packages/q2cli/commands.py", line 329, in call**
** results = action(arguments)
** File "", line 2, in core_metrics_phylogenetic**
** File ".../.conda/envs/qiime2-2021.8/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable**
** outputs = self.callable_executor(scope, callable_args,**
** File ".../.conda/envs/qiime2-2021.8/lib/python3.8/site-packages/qiime2/sdk/action.py", line 485, in callable_executor**
** outputs = self._callable(scope.ctx, view_args)
** File ".../.conda/envs/qiime2-2021.8/lib/python3.8/site-packages/q2_diversity/_core_metrics.py", line 61, in core_metrics_phylogenetic**
** faith_pd_vector, = faith_pd(table=cr.rarefied_table,**
** File "", line 2, in faith_pd**
** File ".../.conda/envs/qiime2-2021.8/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable**
** outputs = self.callable_executor(scope, callable_args,**
** File ".../.conda/envs/qiime2-2021.8/lib/python3.8/site-packages/qiime2/sdk/action.py", line 391, in callable_executor**
** output_views = self._callable(view_args)
** File "", line 2, in faith_pd**
** File ".../.conda/envs/qiime2-2021.8/lib/python3.8/site-packages/q2_diversity_lib/_util.py", line 57, in _disallow_empty_tables**
** return wrapped_function(args, kwargs)
** File ".../.conda/envs/qiime2-2021.8/lib/python3.8/site-packages/q2_diversity_lib/alpha.py", line 49, in faith_pd*
** result = unifrac.faith_pd(table_str, tree_str)**
** File "unifrac/_api.pyx", line 162, in unifrac._api.faith_pd**
ValueError: The table does not appear to be completely represented by the phylogeny.

Plugin error from diversity:

** The table does not appear to be completely represented by the phylogeny.**

See above for debug info.

thermokarst · December 6, 2021, 2:58pm

Great, thanks!

Okay, next up we're going to take a closer look at your tree's tip labels and your feature ids in your table. The following is a python snippet, you can run this in your QIIME 2 conda environment by typing the word python into your terminal, which will load an interactive shell. Copy, paste, and press the enter/return key to run the commands. When you're done, enter the command exit() to leave the shell.

import pandas as pd
import qiime2
import skbio

tree_fp = 'rooted_tree.qza'
tree_art = qiime2.Artifact.load(tree_fp)
tree_obj = tree_art.view(skbio.TreeNode)
tree_ids = set(tip.name for tip in tree_obj.tips())

table_fp = 'Comparison-Taxonomy.qza'
table_art = qiime2.Artifact.load(table_fp)
table_obj = table_art.view(pd.DataFrame)
table_ids = set(table_obj.columns)

print('total tree tips: %d' % len(tree_ids))
print('total feature ids: %d' % len(table_ids))

print(tree_ids)
print(table_ids)

This will let us start to understand what's going on here with some concrete data, which I hope will be helpful. Let me know what you find!

:qiime2:

fjdisofj0ew · December 6, 2021, 6:46pm

print('total tree tips: %d' % len(tree_ids))
total tree tips: 2102697
print('total feature ids: %d' % len(table_ids))
total feature ids: 2
print(tree_ids)

Then it prints the long list of tree IDs.

print(table_ids)
{'sample1', 'sample2'}

There should be 2302807 tips so there are about 8.69 % missing. I am not sure if there is something wrong with the ncbi tree from MEGAN or the import into QIIME2.