Gneiss: connecting regression results with taxonomy

Is there any way to connect the gneiss regression results (eg., the p-values or residuals for each balance) with the taxonomy of each balance without having to use qiime gneiss balance-taxonomy on each balance one at a time? Using qiime gneiss balance-taxonomy is slow, given that the user must do the following:

  1. run qiime gneiss balance-taxonomy for each balance of interest
  2. view the visualization artifact
  3. download the taxonomy csv table (numerator and/or denominator)
  4. rename/move the csv file(s) (optional)
  5. combine all of the taxonomy csv files with the gneiss regression results

I was hoping to connect balances with feature (OTU) IDs by converting the FeatureTable[Balance] artifact: FeatureTable[Balance] --> biom --> tsv. However, this just generates a table of balances x sample. I’m looking for balance-ID --> OTU-ID (and OTU taxonomy).

Hi @nick-youngblut. Let me see if I understand your question. You want to have a way to automatically assign the taxonomies within a balance correct?

I personally haven’t done this through the command line - but this sort of analysis can definitely be done through the Gneiss Python API – but it’ll require a bit of scripting.

But from your question, it sounds like you are interested in getting phylogenetically relevant information, which hints at using a phylogenetic tree rather than the trees provided out of the box with Gneiss. There are some helper functions such as assign-ids that helps massages the tree into a suitable format (i.e. removing bifurcations and creating stable unique ids for the balances).
However, the internal nodes aren’t traditionally decorated with taxonomic names - you’d have to use another procedure such as tax2tree to do this.

That being said, there are other programs out there, namely philr and phylofactor that are more geared towards handling phylogenetic trees using balances and are worth checking out.

Thanks for the helpful suggestions! Just to be clear, I was just trying to get an automated list of OTU taxonomies for all OTUs in significant balances. For this task, I realized that I could just make a script that does the following:

For all significant balances (eg., y0, y1, y4, ...):
    Run `qiime gneiss balance-taxonomy` for the balance
    Export the resulting visualization artifact with `qiime tools export`
    Get the taxonomy tables from the `numerator.csv` and `denominator.csv` files

The point of this would be to get an overall assessment of which taxonomic groups are in significant balances (and in numerator or denominator). Just taking the OTU IDs from the regression summary tables, and querying the taxonomy table for the OTU IDs could also do the job. One more option would be to just map the balances on a 16S tree of all OTUs.

Hi @mortonjt! In this thread, you suggested using phylofactor (or philr). I’ve looked at both R packages (and their publications), and it seems like phylofactor is a great choice, except for one thing: it can only use general linear models (also it’s not available via CRAN, bioconductor, conda, etc). As far as I understand, general linear models assume independent data points, so spatial and/or temporal datasets will likely violate this assumption due to autocorrelation among samples that are close in time/space. What’s really confusing is the fact that the PeerJ paper on phylofactor and the phylofactor tutorial include a phylofactor analysis on a time series dataset: feces + tongue samples taken from only 2 individuals at many time points.

Am I fundamentally misunderstanding how phylofactor and/or general linear models work, or is the time series analysis in the phylofactor manuscript just ignoring any potential autocorrelation among time points, and thus violating the assumptions of the general linear model that was used?

Isn’t this why gneiss allows for linear mixed-effects models?

Any clarification would be greatly appreciated!

Hi Nick!

A few things on phylofactor:

  1. The function PhyloFactor can use glms, gams, and you can even customize your own objective function (e.g. see ? PhyloFactor and scroll down to the Hutchisonian niche finder). The function allows the user to interface with Y, the balances for each edge at each iteration of the algorithm, and produce an objective function which is maximized to choose the phylogenetic factor. For instance, if you know the assumed dependence of your samples, you can make a customized objective function based on generalized least squares which accounts for the covariance in residuals expected under such dependent samples.

  2. A new paper, and soon-to-be advertised new functionality gpf(), generalizes phylofactor to all exponential family random variables. Previously, phylogenetic factors were constructed from data assumed to be Gaussian or some monotonic transformation of a Gaussian (e.g. log, logit, etc.). You could use those assumed-Gaussian variables as either response or explanatory variables (e.g. you can use binomial regression on disease state, as we did with Crohn’s disease), but you couldn’t do proper, say, negative binomial regression on the count data. The new paper linked above discusses how to perform phylofactorization with any exponential family random variables and, by easy extension, discusses how this can be used for spatial and temporal data.

  3. You’re completely correct about the funkiness of the analysis done in the paper. That analysis was done for a mix of three reasons: (1) those were the only data I had on hand, (2) I was lazy as, for me, the math is sufficient proof of the method (a change of basis capturing phylogenetic effects we anticipate), and (3) the purpose of that analysis was to illustrate the method’s ability to identify real blocks in a matrix of real data. All models are wrong, some are useful - the data were definitely not independent and so the estimates were likely not maximum likelihood estimates except under a stepwise variable selection method in which the body site was the first variable used (accounting for the M/F hidden variable would be a good next step, and then possibly the time dependence). The math is correct - the algorithm for how to change basis for describing the likely phylogenetic effects remains unchanged - but the data, as always, are messy and assumptions which are satisfactory for the purposes of one researcher might be insufficient for another. It would be cool to see what happens when people use a more robust dataset of samples from different people (though there may be unaccounted-for dependence among samples there, too, such as systematic differences in diet, age, country, etc.). It’ll also be cool to see what happens when these analyses are replaced with reduced-rank regression analyses under count models as discussed in the paper linked above (though, again, we don’t know the distribution of sequence-count data, but the assumption of negative binomial/multinomial is a decent one, although inferences on the means of a single sequence’s rate parameter for a negative binomial is contentious due to the compositionality of the data).

If you have any issues tailoring phylofactorization to your needs, don’t be a stranger - drop me a line and we’ll get the package in shape to help you meet your research needs. That would be helpful for me, too, to ensure that there’s a user-friendly format prior to submission to CRAN.

Hope this helps! Happy science-ing!


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.