DEICODE - Qurro: Feature loading sOTUs

helenaax2r · January 14, 2020, 4:14am

Hello @cmartino and @fedarko,
I am having so much fun using both DEICODE and Qurro, thank you for developing and making available such great tools!
I would like to build figures like Figure 5 in the DEICODE paper. Is there a way to do it using Qurro? (following a previous post)
Also, could you please help me understand how I could build Figures 5C and 5D (from the same paper) manually, using ordination.qza? I am having difficulties understanding the relationship between the rank plot, Axis (1, 2, and 3), and sample plot.
Thank you!

fedarko · January 15, 2020, 4:44am

Hi @helenaax2r,

Thanks for the kind words!

I would like to build figures like Figure 5 in the DEICODE paper . Is there a way to do it using Qurro? (following a previous post )

Fig. 5 was generated using code in a separate GitHub repository -- I think the code that actually generates these figures is this one, but I didn't write this code (or the DEICODE paper) so I can't say for sure how easy it'd be to adapt this to other datasets.

I'll get to how you can replicate parts of Fig. 5 in Qurro later on below.

Also, could you please help me understand how I could build Figures 5C and 5D (from the same paper) manually, using ordination.qza? I am having difficulties understanding the relationship between the rank plot, Axis (1, 2, and 3), and sample plot.

Sure! Let's take it from the top.

Axes

These refer to the axes (aka PCs) you see in a DEICODE biplot, or really in any sort of PCA visualization. In a biplot, each feature and sample has a loading for each axis. Qurro mostly cares about feature loadings -- what is shown in rank plots of DEICODE output are the feature loadings for a given axis.

To get a sense of what axes "mean", I highly recommend looking at a visualization of a DEICODE biplot (e.g. in Emperor) while looking at Qurro visualizations of DEICODE feature loadings. This'll show you how samples/features are positioned relative to these axes.

Rank plots

The top plots shown in Figs. 5C and 5D (the bar plots) describe the Axis 1 feature loadings of features in the DEICODE biplot, sorted in descending order.^1,2 These bar plots are also referred to as rank plots. Each³ "feature" (ASV, sOTU, etc.) in a dataset is represented as a bar in this plot -- the Axis 1 loading for each feature is shown on the y-axis.

These feature loading values are really just numbers -- you can create these rank plots from them, like Qurro does and like fig. 5 does, to provide a quick indication of how features in a dataset correspond to some sort of variation. @cmartino would be able to explain the math details better than me, but long story short you can think of the feature loading values as the indications of how features "contribute" to variation along a given axis in the biplot. (For details about interpreting these loadings / interpreting compositional biplots in general, I'd recommend checking out section 4 of Aitchison and Greenacre (2002).)

One more note about rank plots I should mention -- although figs. 5C and 5D's rank plots are only of the Axis 1 feature loadings for those biplots, you could totally switch these to Axis 2, Axis 3, etc. to show features' loadings for those axes instead. This is doable in Qurro using the "Feature Loading" selection box below the rank plot. (Trying this could be useful if, for example, a sample group you're interested in was separated along Axis 2 instead of Axis 1 in the biplot -- see Fig. 2 of this preprint for an example in practice.)

Sample plots

This is a term I made up for Qurro (sorry to introduce more words into the literature...). Basically, all this is is a scatterplot/boxplot of samples in a dataset: one axis is the log-ratio of features for each sample, and the other axis is some other variable for each⁴ sample. The reason this is shown alongside the rank plot is that, hopefully, the rank plot should serve as a guide for what features to try looking at log-ratios of -- and the sample plot should indicate how these log-ratios vary across different samples.

In Qurro's sample plots the y-axis corresponds to some selected log-ratio, and the x-axis can be set to any sample metadata field of interest. Fig. 5C/5D's plots are a bit different from how Qurro organizes sample plots -- here, the x-axis (not the y-axis!) indicates the log-ratio values, and the y-axis are just the sample loadings for Axis 1 of the DEICODE biplot.

Replicating parts of Fig. 5 in Qurro

Replicating the rank plots

You should be able to create figures like the rank plots in Figs. 5C and 5D using Qurro, but of course they will look slightly different (e.g. features will be ranked in ascending instead of descending order by their loadings; colors will be different). When you select features for a log-ratio in Qurro they'll be highlighted on the rank plot, analogous to how Synechococcus, Cereibacter, etc. are highlighted in Figs. 5C/5D (although you'd need to add in the fancy arrow / text box saying Synechococcus (g) yourself, at least for now).

I should note that support for selecting multiple "groups" of features at once like as shown in the 5C/5D rank plots is still kind of messy in Qurro -- you can try something like filtering on features that contain the separated text fragments and then putting down something like Synechococcus, Cereibacter to replicate the blue features highlighted in the left rank plot in fig. 5C. However, this will include all features containing these text fragments, not just the highest or lowest ranked features with this text.

Replicating the "sample plots"

For creating figures like the plots shown below the rank plots (indicating correlation between PC1 of the sample loadings and certain log-ratios), this isn't currently doable in Qurro since Qurro doesn't do anything with biplots' sample loadings yet. I have an open issue to add support for generating these kind of figures in the sample plot eventually, but due to other obligations I probably won't be able to get around to that for some time. In the meantime you may be able to use the code in the aforementioned DEICODE repo to do this.

Hope this all helps make things clearer! Please let us know if you have any other questions, and thanks for trying out these tools.

Footnotes

¹ It's probably worth noting that earlier versions of DEICODE had some bugs relating to which axis was labelled which -- see here for details. This shouldn't impact interpretation/results much, but I would suggest making sure you're using the latest version of DEICODE.

² In this paper and in Qurro, we sort these plots instead in ascending order -- but this choice doesn't make much of a difference aside from flipping the plot horizontally. You're still "ranking" the features

³ In practice, some features might get filtered out for some reason.

⁴ In practice, some samples often end up being filtered out of these kinds of plots, due to e.g. having a count of 0 for one side of a log-ratio (log of 0 is undefined, as is log(x/0)). Qurro tries to be explicit about this happening.

helenaax2r · January 16, 2020, 2:19pm

Hello @fedarko,
Thank you so much for such a detailed response!

Thank you for pointing this out, I had an 'aha!' moment with my analysis!

I see I can export the 'log-ratios' of taxa that I am comparing per sample using the option "Export current sample plot data" in the sample plot panel. Could you please indicate how I can obtain the corresponding PC1 value?
My last question is regarding the arrows in the biplot. I used an ordination.qza created from qiime deicode rpca. I chose --p-number-of-features 10. I noticed that 9 out of the 10 most important features shown in the biplot, are also the features with the highest frequencies in table.qza. Is it safe to say that those highly abundant features also happen to be the most important features?

I am so grateful for all your help!
Best,
Helena

fedarko · January 16, 2020, 8:40pm

Great, glad this is helpful!

I see I can export the ‘log-ratios’ of taxa that I am comparing per sample using the option “Export current sample plot data” in the sample plot panel. Could you please indicate how I can obtain the corresponding PC1 value?

The easiest way I can think of doing this is using QIIME 2's python "Artifact API" to read the sample loadings from the DEICODE ordination. Here's an example for the sleep apnea dataset used in Qurro right now --

>>> from qiime2 import Artifact
>>> from skbio import OrdinationResults
>>> ordination = Artifact.load("ordination.qza").view(OrdinationResults)
>>> sample_loadings = ordination.samples
>>> print(sample_loadings)
                      0         1         2
10422.21.F.3  -0.050270  0.051419 -0.039862
10422.28.F.9  -0.004828 -0.085672  0.121970
10422.24.F.10 -0.024093  0.008705 -0.083470
10422.21.F.10 -0.045366  0.009991 -0.054587
10422.27.F.8  -0.011025 -0.087463  0.069552
...                 ...       ...       ...
10422.17.F.11 -0.030206  0.007674  0.048584
10422.31.F.4  -0.028123  0.045799 -0.012168
10422.31.F.13  0.007370 -0.105690 -0.068229
10422.18.F.11 -0.034735  0.041962  0.033664
10422.28.F.4  -0.028692  0.031643  0.048375

[184 rows x 3 columns]

sample_loadings is a pandas DataFrame describing the loadings for each sample -- each row is a sample, and each column is an Axis/PC/etc. You could load the exported sample plot data (i.e. the selected log-ratios for each sample, and also some other metadata) as a pandas DataFrame (for an example of loading this file in pandas see here) and then merge that with sample_loadings to get one big table. From there you should be able to use pandas/matplotlib/etc. to plot things as you like.

My last question is regarding the arrows in the biplot. I used an ordination.qza created from qiime deicode rpca. I chose --p-number-of-features 10. I noticed that 9 out of the 10 most important features shown in the biplot, are also the features with the highest frequencies in table.qza. Is it safe to say that those highly abundant features also happen to be the most important features?

It might be the case for your dataset, but I don't think you can say this in general I'm pretty sure it's possible for the most abundant features across all samples to not be the most "important" in the feature loadings -- e.g. in a case where one of your features is highly abundant just because it's present in all of the samples at roughly the same relative abundance, I don't think this feature would be considered to be "important" in the feature loadings. @cmartino might have other thoughts on this...

cmartino · January 16, 2020, 9:35pm

Often the differential microbes between sample groups are not the most abundant or frequently occurring. In many practical cases, you may want to see what are the abundant & differential microbes. This will ensure you do not have a high sample dropout when taking log-ratios. To do this you can apply a feature frequency filter or sum filter (found here) to your FeatureTable[Frequency] table before running DEICODE or Songbird. But luckily you do have that problem with this dataset .

helenaax2r · January 17, 2020, 3:40am

Thank you, @cmartino!
I got a little confused with this:

In my dataset, 41/42 samples are valid when I check the results using Qurro (i.e., only 1 sample has an invalid log-ratio). Do you think I do have a problem with this dataset? I was hoping you meant that I "... do not have that problem with this dataset", but just wanted to make sure.

Thank you!

helenaax2r · January 17, 2020, 3:42am

This is great! Thank you so much, @fedarko!

cmartino · January 17, 2020, 6:31pm

I apologize for the confusion I misspoke, you do not have that problem.