Questions about interpreting DEICODE and Qurro output

arwqiime · May 8, 2020, 11:31am

Hi @fedarko and @cmartino , you and you colleagues did a great job with deicode and qurro!
To be sure, I would like to ask you about the meaning of the explanation of the 'important features: You wrote that "the important features with regard to sample clusters are not a single arrow but by the log ratio between features represented by arrows pointing in different directions".
What do you exactly mean with "between features represented by arrows". I produced a qurro plot from your data, and if I select g__Streptococcus as the numerator and again in the denominator (but here with 'is provided and does not contain'), I can see log ratio differences in the box plot for BodySite (gut log ratio ca. -7 and palm/tonge ca. -2). And I get similar differenced with g__Bacteroidesalone (log ratio in the other direction: ca. 0 for gut vs. -5 palm/tonge). So, I used these features individually and did not compare 'between'.

My understanding was that the arrows in the biplot indicte the most contributing features to the rpca biplot, but can be taken individually for more detailed investigations.
In qurro, these features can be selected as a starting point to investigate it in more detail, but at the same time, the autoselectoin option is also a good point where to start searching for contributing taxa if the biplot showed a separation of sample groups along this axis.

May be this is only a missunderstanding of the wording, but I would appreciate your comment.
Best regards

fedarko · May 11, 2020, 6:34pm

Hi @arwqiime,

Thanks for the kind comment!

"Features" here just means whatever the observations are in the table you pass in to DEICODE. For the moving pictures dataset these are ASVs computed by DADA2; depending on how your table was generated and what sort of data you're working with these could instead be OTUs, genes, etc.

For 16S rRNA sequencing, it's worth noting that multiple features can have the same taxonomy annotation. This is normal, since 16S usually can't give you resolution more specific than the genus level. Within a single genus there's room for a lot of functional variation, which is one explanation for why you might see similarly-annotated ASVs across different types of samples. (This, in turn, may be part of the reason why a lot of times you see the same type of feature on completely different sides of the rank plot -- notice how there are lots of Streptococcus distributed throughout both the left and right sides of the Axis 1 feature loadings in the DEICODE biplot shown at the bottom of this tutorial.)

So, this approach works, but I don't think this is the best way of using Qurro. What question are you asking? This tells you that the log-ratio of Streptococcus to all other features in the rank plot is higher in palm/tongue samples than in gut samples, but it doesn't really say much beyond that. It may also be hard to reproduce this exact result on other datasets -- see below for more details.

The way I see it, autoselection is useful as a way to check that there is some separation relative to a given metadata field from the rankings (e.g. with the DEICODE biplot, gut samples separate from the other body sites along Axis 1, so doing autoselection for features along that axis will probably separate gut samples as well); if this is confirmed, I usually try to dig in to what specific features or types of features are contributing to that separation. The arrows in the biplot can be a good place to start for this, since the few that Emperor shows you are the ones with the highest magnitude.

Similarly, the is provided, and does not contain searching (as you've described using it above) could be useful for saying that "this feature looks like it might be differentially abundant in these sample(s)", but I don't know if it's really useful for any conclusions beyond that.

Selecting a more targeted log-ratio (where the denominator is a well-defined set of features / genera / etc.) is nice because it's easier to reproduce in a different dataset: it's easier to reproduce "the log-ratio of Staphylococcus aureus to Propionibacterium acnes was higher in lesioned than non-lesioned samples" than it is to reproduce "the log-ratio of Staphyloccocus aureus to literally every other feature in the rank plot was higher in lesioned than non-lesioned samples", because now you have to worry about measuring all of the features in the denominator. You can imagine that for situations like comparing 16S vs. shotgun metagenomic sequencing datasets, or even multiple 16S datasets using different PCR primers, etc., the "background noise" of other features in the rank plot can complicate things here a lot. (Edit: If all of the studies in question have a lot of features, then this problem might not be too bad -- there's a paragraph in the discussion section of this recent paper [it's the part starting at "Using the terminology"] which describes this problem a bit.)

(Edit: another reason to select more well-defined denominators is that you can have the denominator be consistent across multiple log-ratios, whereas when you're just keeping the denominator as "every feature that isn't the numerator" the denominator will also change when you change the numerator, if that makes sense)

I didn't write that sentence so I can't say for sure, but I think the mention of "arrows pointing in different directions" is based on Aitchison and Greenacre (2002), which describes how to interpret compositional biplots. I really recommend reading through the paper -- some of the math can be a bit intimidating (looking at you, section 2 ._.), but section 4 in particular does a nice job explaining various ways to interpret these biplots.

So, let's check out Fig. 3 in that paper. Here, "features" are colors and "samples" are abstract paintings, but the same principles as with ASVs and microbiome samples should still generally apply.

I think properties 4 and 7 might be useful here:

4.4. Property 4

Angle cosines between links in the covariance biplot estimate correlations between log‐ratios. Thus the fact that the links between blue, yellow and red lie perpendicularly to the links between white, other and black indicates that log‐ratios among the first set have near zero correlations with those among the second set.

4.7. Property 7

If a subset I of the individuals (rows) and a subset J of the components (columns) [note: here "rows" mean samples and "columns" mean features] lie approximately on respective straight lines that are orthogonal, then the compositional submatrix formed by the rows I and columns J has approximately constant log‐ratios among the components, i.e. the double‐centred submatrix of log(compositions) has near‐zero entries. For example in both biplots it is possible to see a group of three row points in the lower left quadrant (rows 9, 21 and 15) which are in a straight line that is orthogonal to the line defined by the three column points white–other–black.

These paragraphs are really dense, but the TLDR of these as I interpret them (and based on personal experience working with DEICODE / compositional biplots) is that if you select the log-ratio of two features pointing in roughly opposite directions, you usually get a log-ratio that separates samples along that axis of the biplot -- and this can be useful if you care about that separation.

(Caveat 1: In practice things like sparsity are going to complicate this -- taking log-ratios between individual features can often result in lots of sample dropout due to zeroes, which is part of why Qurro's moving pictures tutorial has a section describing moving "up" taxonomic ranks to avoid this.)

(Caveat 2: If the biplot's top PCs don't explain a lot of variation then I think there will be additional problems with this. This problem is touched on a bit in this PDF from Greg Gloor -- section 4.1 should be useful, and seems less intimidating than the Aitchison/Greenacre paper mentioned above.)

Hope this helps! Sorry this kind of turned into a huge mess of a post. I'm gonna tag @cmartino to see if he has anything to add (and to check that I'm not misleading you...) Thanks for using these tools!

arwqiime · May 12, 2020, 7:53am

Hi @fedarko, thanks a lot for your meaningfull comments on looking at the data in several ways and the limitations of some 'first views'.
Best regards!