Gemelli phylo-RPCA-with-taxonomy overwriting ASV taxonomy in an unexpected way

Hi, I'm using phylo-RPCA-with-taxonomy in Gemelli (@cmartino) and have encountered some off behavior in the phylo-taxonomy output. After running the function, a number of tree tips (ASVs) have been reassigned to different phyla, and (strangely) all Proteobacteria have been assigned to a single genus (Rhodanobacter)! (see differences in taxonomy vs phylo-taxonomy visualizations and rooted-tree vs. phylo-rooted tree Empress plots.)

The data I'm working with are ASVs output from DADA2, with taxonomies assigned with a naive Bayesian classifier trained on Silva138 V3-V4.

I'm presuming this reclassification is a result of the tax2Tree decoration step that Gemelli uses, and I'd expect some minor changes to taxonomy occur during this, but jumping phyla or reclassifying a whole phylum as a single genus is far beyond what I expected. Is this occurrence to be expected, or have I triggered something unusual?
Thanks in advance,
-Nathan Stewart

CODE:
qiime gemelli phylogenetic-rpca-with-taxonomy --i-table ts_feature-table.qza --i-phylogeny rooted-tree.qza --m-taxonomy-file taxonomy.qza --p-min-feature-count 10 --p-min-sample-count 500 --o-biplot phylo-ordination.qza --o-distance-matrix phylo-distance.qza --o-counts-by-node-tree phylo-tree.qza --o-counts-by-node phylo-table.qza --o-t2t-taxonomy phylo-taxonomy.qza

phylo-rooted-tree.qzv (623.8 KB)
rooted-tree.qzv (614.8 KB)
phylo-taxonomy.qzv (1.3 MB)
taxonomy.qzv (1.3 MB)
rooted-tree.qza (43.0 KB)
taxonomy.qza (122.1 KB)
ts_feature-table.qza (125.2 KB)

Screenshot of initial Empress tree:

Screenshot of Empress tree post phylo-RPCA:

2 Likes

Hi @NathanStewart,

Thanks for using phylo-RPCA and for reporting this issue. I agree, this is concerning. I ran through your example and was able to reproduce the results. After digging into the individual steps it seems the issue is in tax2tree which phylo-RPCA is wrapping to get the LCAs. I am going to bring @wasade here for help because I am not as familiar with the inner workings of tax2tree. @wasade is this some unexpected behavior by t2t?

Thanks,

Cameron

1 Like

Hi @NathanStewart,

How was the phylogeny created?

Best,
Daniel

1 Like

Hi @wasade,
I made it using the align-to-tree-mafft-fasttree command:

qiime phylogeny align-to-tree-mafft-fasttree --i-sequences rep-seqs.qza --o-alignment aligned-rep-seqs.qza --o-masked-alignment masked-aligned-rep-seqs.qza --o-tree unrooted-tree.qza --o-rooted-tree rooted-tree.qza

I can provide a subset of sequences in the rep-seqs.qza if needed.
Thanks for the help.

1 Like

Thanks, @NathanStewart.

@cmartino, Kalen Cantrell, and I are discussing this out of band. If I do a decoration directly with tax2tree, using effectively the same input data, I get results that on the surface seem reasonably consistent with the input taxonomy. See below for which ASVs are Rhodanobacter in the input (df) and the tax2tree observed (obs). We are working on figuring out the source of the discrepancy.

Best,
Daniel

In [15]: df[df['Taxon'].str.contains('Rhodanobacter')]
Out[15]: 
                                                        Taxon  Confidence
Feature ID                                                               
ASV3        d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.729268
ASV24       d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.999961
ASV219      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.837454
ASV236      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.999920
ASV284      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.999972
ASV309      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.999927
ASV314      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.714771
ASV363      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    1.000000
ASV426      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.889574
ASV453      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.998887
ASV539      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.965687
ASV580      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.999976
ASV613      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.999983
ASV644      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.750719
ASV675      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.992487
ASV756      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.855913
ASV848      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    0.884113
ASV879      d__Bacteria; p__Proteobacteria; c__Gammaproteo...    1.000000

In [16]: obs[obs['Taxon'].str.contains('Rhodanobacter')]
Out[16]: 
                                                        Taxon
Feature ID                                                   
ASV3        d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV24       d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV219      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV236      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV284      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV309      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV314      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV333      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV363      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV426      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV453      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV539      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV580      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV613      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV644      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV675      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV756      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV848      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
ASV879      d__Bacteria; p__Proteobacteria; c__Gammaproteo...
2 Likes

@wasade If possible, would you mind sharing how ran this? I am encountering some difficulties running the t2t decoration directly myself.
Thanks,
-Nathan

Hey @NathanStewart,

Sorry for the delay, I've been offline a few days.

It's not straight forward. The QIIME 2 taxonomy output is not uniform, where some lineages are specified to s__ but not all are. tax2tree requires lineages are fully specified so there is some manual normalization that is needed. I think that was the main crux.

I would prefer a solution within Gemelli, rather than suggesting a hack. Let's wait to hear back from the Gemelli developers, and if the fix is convoluted or likely to take a while to implement, then I'll propose a manual workaround. Sound reasonable?

Best,
Daniel

1 Like

@wasade
No worries.
Sure thing, having a solution within Gemelli would be preferable. Can I run my t2t attempt by you though to see if there's anything obviously incorrect?

  1. Exported taxonomy.qza to a tsv file.
  2. Used the code suggested by @mortonjt here to make the taxonomy readable by t2t. This seems to work, as in the output (gg2-t2t_consistency.tsv (115.0 KB)) all ASVs are specified down to s__, even though some levels are empty.
  3. Run t2t decorate: t2t decorate -m gg2-t2t_consistency.tsv -t rooted-tree.nwk -o gg2_t2t_out2

The decoration will run, but it will throw a bunch of OBSERVED/EXPECTED warnings, e.g: AFFECTED: 2 EXAMPLE: ASV730 OBSERVED: ['p__Patescibacteria', 'c__Saccharimonadia', 'o__UBA4664', 'f__UBA4664', 'g__UBA4664'] EXPECTED: ['d__Bacteria', 'p__Patescibacteria', 'c__Saccharimonadia', 'o__UBA4664', 'f__UBA4664', 'g__UBA4664']
The output tree (gg2_t2t_out.txt (29.1 KB)) and consensus strings (gg2_t2t_out-consensus-strings.txt (42.2 KB)) only specify d__bacteria and leave all other levels blank, though F-measures are produced for all taxa at all levels (gg2_t2t_out-fmeasures.txt (8.0 KB)).

Thanks for your time, and any help is greatly appreciated!
-Nathan

Oh right. That. When I was checking this, I hit that bug too and just realized I forgot to open an issue about it -- it's been a hectic couple of weeks. An issue is now open.

That consistency check was added in when we were working on Greengenes2 as a means truncate lineages that were incompatible with the input taxonomy. This can arise if the phylogeny and input taxonomy disagree substantially in parts of the tree, and is a safety to avoid producing something definitely incorrect. We designed the code to around two domain trees. In this case, the tree does not span two domains, so the domain label ends up placed on the root node. That violates an incorrect assumption (i.e., bug) the code currently makes (see here) where it assumes the root is unlabeled. As a result, the observed labels lack Bacteria, therefore creating a consistency issue, resulting in truncation, and a tree that doesn't have anything interesting when written.

If you're feeling adventurous, you could either hack the t2t code to remove the consistency check (ie comment this out), or issue a fix and send a PR ( :slight_smile: ).

Best,
Daniel

Hi @NathanStewart thank you so much for your patience, we are still working behind the scenes on a fix for this. I opened an issue on this GitHub here. Stay tuned and thank you again for reporting this!!

1 Like

@wasade Thank you! I commented out that section on a t2t install and it worked great, with a much more reasonable output. No guarantees I can issue a fix; my python coding is mediocre at best...

@cmartino No problem, and thank you for working on it!

3 Likes

Hi @NathanStewart,

Thank you for your patience, this bug has been fixed (see here). If you re-install (v.0.0.8 -> v.0.0.9) the bug is now fixed.

Thanks again for reporting this!

Cameron

3 Likes

Thanks so much @cmartino ! I tried v.0.0.9 and the output was much more reasonable.
-Nathan

1 Like