Importing Phylogenetic Tree from MEGAN

Hello,

I am trying to import phylogenetic information into QIIME2 for diversity analysis, but I am having some issues. The main error message I get is "The table does not appear to be completely represented by the phylogeny". I have a FeatureTable[Frequency] with taxonomic assignments from a .biom file I exported from MEGAN. It is in this format:

"rows":[{"metadata":{"taxonomy":["d__Bacteria"]},"id":"2"},{"metadata":{"taxonomy":["d__Bacteria","s__uncultured bacterium"]},"id":"77133"},{"metadata":{"taxonomy":["d__Bacteria","p__Bacteroidetes"]},"id":"976"},{"metadata":{"taxonomy":["d__Bacteria", ....
"columns":[{"metadata":{},"id":"ES9-1"},{"metadata":{},"id":"ES1-2"}],"matrix_type":"dense","matrix_element_type":"int","shape":[93,2],"data":[[0.0,43404.0],[12123.0,0.0],[0.0,0.0],[0.0,0.0],[352628.0,1969263.0],[0.0,0.0],[10786.0,5120.0],[23463.0,41556.0],[18384.0,0.0],[0.0,12236.0],[0.0,0.0],[0.0,0.0],[14066.0,0.0],[0.0,0.0],[73603.0,234769.0],[35838.0,11903.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[1243603.0,1729832.0],[149361.0,57626.0],[34720.0,0.0],[40593.0,29055.0],[53655.0,57874.0],[0.0,96118.0],[0.0,0.0],[0.0,0.0],[0.0,0.0],[5441.0,73132.0],[8567.0,0.0],[28942.0,207225.0],[0.0,0.0],[0.0,0.0], ....

I try to export a .tre file from MEGAN and use it with this frequency table, but the error I mentioned above appears. I also tried using the ncbi.tre file from MEGAN, but the same issue occurs. The files are below. I think the issue is incompatibility between how taxonomies are represented (ID versus written out), but I am not sure. I am not looking for an exact recipe for importing these phylogeny files, but I am trying to diagnose the issue. Does anyone know what the problem is?
Comparison.tre (2.7 KB)
ncbi.tre.gz (6.0 MB)

Hi @fjdisofj0ew,

I think is related to a... let's not call it a bug, but a particular feature of the QIIME 2 import pipeline that casts underscores in tree ids to spaces. My solution has been to transform my ids before import. (I do a find and replace and convert "_" to "-" in both my table and tree.

It may not exactly be your error, but worth a try.

Best,
Justine

3 Likes

Hi @jwdebelius, thank you so much for this explanation! That makes a lot of sense. Thanks again for your help.

2 Likes

Hi! Just to follow up, I am not convinced the underscores being cast to spaces is the issue. I am confused as to what the problem is, actually. Here is my workflow:

Download .biom file from MEGAN -> convert .biom file to FeatureTable[Frequency]

qiime tools import \
  --input-path Comparison-Taxonomy.biom \
  --type 'FeatureTable[Frequency]' \
  --input-format BIOMV100Format \
  --output-path Comparison-Taxonomy.qza

This feature table lists the features as their NCBI taxonomic IDs and not as their written-out names (e.g. 291644 instead of d__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides salyersiae). I know this because when I use feature-table summarize to convert the FeatureTable[Frequency] to a .qzv file, the features are listed as these numbers rather than the taxa names.

(I don't know if this is useful information, but to make bar plots, I created a file I named taxonomy.txt with each taxon ID associated with its written-out name and imported it as FeatureData[Taxonomy]. I then supply it in the --i-taxonomy option for qiime taxa bar plot so the bar plot has names instead of numeric IDs.)

Here is the strange part. I wrote a short script to take every numeric ID present in my .biom file and see if it is present in the ncbi.tre file. The ncbi.tre file, attached earlier above, only lists phylogeny with these numeric IDs rather than spelled-out names. Every single ID present in the .biom file is also present in the ncbi.tre file. So this is where I am stumped as to why there is an issue.

Here is how I am importing the ncbi.tre file, which is in Newick format and is a rooted tree.:

qiime tools import \
  --input-path ncbi.tre \
  --output-path rooted-tree.qza \
  --type 'Phylogeny[Rooted]'

qiime diversity core-metrics-phylogenetic \
  --i-phylogeny rooted-tree.qza \
  --i-table Comparison-Taxonomy.qza \
  --p-sampling-depth 6655889 \
  --m-metadata-file Comparison-metadata.txt \
  --output-dir core-metrics-results

Any insight or ideas would be so so appreciated.

1 Like

Hi @fjdisofj0ew,

If you look at the first few lines of the table, what IDs are present there? You could run something like

head ncbi.tre

to check.

The other option is that the tree covers some, but not all of the tips. (I've also had this issue recently). I might trouble shoot that by checking how filtering changes the feature table.

Best,
Justine

The file is quite large, but the beginning starts like this (head ncbi.tre):

((((((1985431)1985430,(1985433)1985432,(1660251)1660250,((2790964)2790963)2790962)458031,((((1335627,1615681,946336,2218256,1335629,542300,(1661951)1661950,(1872106,1078845)2669388)1078830,(1560006,768535,(1401345)1401344,(1872113,2303751,2305226)2638815)1000999,(2211140,2608730)2608728,(1382359,(240015)33075,(1380837,114706,115380,115381,115382)114705,(1872119,2763107,433983,1336583,192103,191847,191869,2505759,1082710,1671493,747482,1454928,1641853,1641854,939176,939177,939178,939179,939180,939181,939183,1391499,146085)2625945)33973,(863522,(1401343)1401342,(1641857)2627905)1078860,((204669)658062,(705306)705305,(2510641)2677081)658061,(2042962,(2771375,2043160,2771378,2771377,2771376)2635724)2136116,(2043161)2136117,(2040573,(1121860)570835,2259016,1560005,1933044,2051959,388466,(492480)492478,(1934404,2703788,2763071,577536,1933043,1378307,1896652,1641863,1703116,1703129,1933045)2637509)388463,(871561,655996,655991,194844,134726,134727,78975,78941,78958,90880,90882,90881,90879)112075,((1241325)1241324,1335628,474949,940613,741063,(682795)940614,474951,474950,474952,940616,2479048,(1198114)940615,(1969471,1617967,2485170,1270661,1270662,1270663,1277346,1641865,1747222,1641866,1322326,1747224,1747226,1660100,1911516,1946580,1946581,2602070)2621151)940557,(1002689,1002691)1742983,((1775470)1775469,1577686)1775468,((1401347)1401346,474953,(2171561,1338507,1338508,2171563)2634852)1078829,(1577687,(1964191)2634994)1768185,((626101)626102,1592106,940139,(926566)392734,(401053)870903,1111115,(1889013,1238175,1713183,1713184,1190247,1641873,1527260,1384625,278961,278962,1534220)2628988)392733,(639029,639031,639032,2052142,2651330,2651329,1002692,1078886,1671529,367933,363880,363879,363878,1078887,639030,1267533,1267534,298608,1387376,1387377,1387378,1387379,1229651,1671530,713466,1402914,1402920,298609,278963,424383,1951344,1951345,1380348,542301,542302,542303,542304,542305,542306,542307,542308, ...

Hi @fjdisofj0ew,

So, looking at your tree file, it looks like it has numeric IDs. ...So it's probably not a direct mismatch between the IDs.

Have you tried filtering the table to see how much you keep? (Try the q2-phylogeny plugin for ways to filter with trees).

Best,
Justine

Hi @jwdebelius

After filtering with

qiime phylogeny filter-table
--i-table Comparison-Taxonomy.qza
--i-tree rooted-tree.qza
--o-filtered-table filtered-Comparison-Taxonomy.qza

I realized that QIIME2 is only recognizing the numeric IDs in the ncbi.tre file that are surrounded with commas (e.g. "60614,60615,60616,60600,60621"). If there are parentheses adjacent, it fails to see them (e.g. "(1852370,(1926673)2625192)"). How should I modify the .tre file to preserve the phylogenetic relationships but allow QIIME2 to recognize the IDs?

Hi @fjdisofj0ew, I don't mean to step on @jwdebelius's toes, but thought this might be useful for you:

http://scikit-bio.org/docs/0.5.6/generated/skbio.io.format.newick.html#explanation

QIIME 2 uses scikit-bio for working with trees, so all of the rules outlined in the link above are going to apply here. I'll draw your attention to this passage:

More characters can be used to create more descriptive labels. When creating a label there are some rules that must be considered due to limitations in the Newick format. The following characters are not allowed within a standard label: parenthesis, commas, square-brackets, colon, semi-colon, and whitespace.

What if these characters are needed within a label? In the simple case of spaces, an underscore (_) will be translated as a space on read and vice versa on write.
What if a literal underscore or any of the others mentioned are needed? A label can be escaped (meaning that its contents are understood as regular text) using single-quotes ('). When a label is surrounded by single-quotes, any character is permissible. If a single-quote is needed inside of an escaped label or anywhere else, it can be escaped with another single-quote. For example, A_1 is written 'A_1' and 'A'_1 would be '''A''_1'.

Oof, no fun.

Hi @thermokarst Thank you for this information! Looking at the newick format explanation, I see it's true parenthesis are not allowed in labels.

The following characters are not allowed within a standard label: parenthesis, commas, square-brackets, colon, semi-colon, and whitespace

But aren't parenthesis used to show phylogenetic relationships?

To provide these relationships, there is another structure: paired parenthesis (( ) ). These are inserted at the location of an existing node and give it the ability to have children. Placing ( ) in a node’s location will create a child inside the parenthesis on the left-most inner edge.

So the ncbi.tre file seems to be in proper Newick format.

Yes indeed, but keep in mind that a node label is not the same thing as the tree itself - its just arbitrary metadata tacked onto the tree.

1 Like