Biplot Vector Selection

Karrma · July 29, 2022, 7:14pm

I was creating some biplots in Qiime and noticed an odd "discrepancy" with the vectors being selected.
It says here that vector order is selected via importance and that "“Importance” is calculated for each feature based on the vector’s magnitude (euclidean distance from origin)".

My understanding from this is that the top 5 vectors are calculated as the most important and visually will appear as the largest vectors. However, if I pick a number of vectors instead of going with the default of 5 (ex: --p-number-of-features 10 ) then I start to see larger vectors included that were not present in the original top 5 (pictured at the bottom- white are vectors present in Top 10 but not top 5).

Is vector size not related to its importance? If not, then what determines vector size?
If vector size is related to importance - why doesn't Qiime output the top 5 largest vectors?

gregcaporaso · August 1, 2022, 6:50pm

Hi @Karrma, Interesting question. I had to poke around with this on my own to get an idea of why this would happen.

The magnitudes are related to the importances, though they are calculated among all dimensions/axes - not just the ones that you're viewing at any given time. The following code provides the sorted importances associated with all features (adapted from the biplot source code here and using data from the Moving Pictures tutorial). This is how qiime emperor biplot chooses which axes to display.

In [2]: import qiime2

In [3]: a = qiime2.Artifact.load('./biplot.qza')

In [5]: from skbio import OrdinationResults

In [7]: biplot = a.view(OrdinationResults)

In [8]: feats = biplot.features.copy()

In [9]: import numpy as np

In [10]: origin = np.zeros_like(feats.columns)

In [11]: from scipy.spatial.distance import euclidean

In [12]: feats['importance'] = feats.apply(euclidean, axis=1, args=(origin,))

In [13]: feats.sort_values('importance', inplace=True, ascending=False)

In [17]: feats['importance'][:10]
Out[17]:
fe30ff0f71a38a39cf1717ec2be3a2fc    5.310779
997056ba80681bbbdd5d09aa591eadc0    5.053735
1d2e5f3444ca750c85302ceee2473331    4.063721
4b5eeb300368260019c1fbc7a3c718fc    3.416021
ab4ef4399912b0507d8d1187e874684d    3.359870
3c9c437f27aca05f8db167cd080ff1ec    2.559068
868528ca947bc57b69ffdf83e6b73bae    2.344982
eecc4a4317225eb579540e82ab785716    1.756212
9079bfebcce01d4b5c758067b1208c31    1.710010
e8e6b7fc969005938de8ac7ffb94f17c    1.633364
Name: importance, dtype: float64

The corresponding biplots are below. The features in purples are the features highlighted when calling qiime emperor biplot with the default settings (show the top five most important features) and the features in yellow are the ones added when requesting the 10 most important features.

First, when looking at Axis 1 and 2 (Axis 3 is perpendicular to the plane that we're viewing here, so is effectively hidden) notice the purple features are those with the largest importance scores (see the data above). Also notice that feature 1d2e... has one of the largest magnitude vectors.

Now let's look at Axes 1 and 3. Notice that feature 1d2e... now has one of the shorted magnitudes (it's harder to see the label this time unfortunately). This illustrates that the magnitude that we can see in the first two dimensions doesn't reflect the magnitude in all of the dimensions used to compute the importance.

One thing that you can try, to get the magnitudes that you see to line up better with the importances, is to compute PCoA for only two or three axes. You can do this by calling qiime diversity pcoa with the --p-number-of-dimensions parameter set to 2 or 3. Then compute your biplot from that PCoA matrix. If you give this a try, I'll be very interested to hear if that changes your final interpretation of the data - if so, this might be something we want to suggest to users.

system · September 2, 2022, 12:58am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.