I was creating some biplots in Qiime and noticed an odd "discrepancy" with the vectors being selected.
It says here that vector order is selected via importance and that "“Importance” is calculated for each feature based on the vector’s magnitude (euclidean distance from origin)".
My understanding from this is that the top 5 vectors are calculated as the most important and visually will appear as the largest vectors. However, if I pick a number of vectors instead of going with the default of 5 (ex: --p-number-of-features 10 ) then I start to see larger vectors included that were not present in the original top 5 (pictured at the bottom- white are vectors present in Top 10 but not top 5).
- Is vector size not related to its importance? If not, then what determines vector size?
- If vector size is related to importance - why doesn't Qiime output the top 5 largest vectors?
Hi @Karrma, Interesting question. I had to poke around with this on my own to get an idea of why this would happen.
The magnitudes are related to the importances, though they are calculated among all dimensions/axes - not just the ones that you're viewing at any given time. The following code provides the sorted importances associated with all features (adapted from the biplot source code here and using data from the Moving Pictures tutorial). This is how
qiime emperor biplot chooses which axes to display.
In : import qiime2
In : a = qiime2.Artifact.load('./biplot.qza')
In : from skbio import OrdinationResults
In : biplot = a.view(OrdinationResults)
In : feats = biplot.features.copy()
In : import numpy as np
In : origin = np.zeros_like(feats.columns)
In : from scipy.spatial.distance import euclidean
In : feats['importance'] = feats.apply(euclidean, axis=1, args=(origin,))
In : feats.sort_values('importance', inplace=True, ascending=False)
In : feats['importance'][:10]
Name: importance, dtype: float64
The corresponding biplots are below. The features in purples are the features highlighted when calling
qiime emperor biplot with the default settings (show the top five most important features) and the features in yellow are the ones added when requesting the 10 most important features.
First, when looking at Axis 1 and 2 (Axis 3 is perpendicular to the plane that we're viewing here, so is effectively hidden) notice the purple features are those with the largest importance scores (see the data above). Also notice that feature
1d2e... has one of the largest magnitude vectors.
Now let's look at Axes 1 and 3. Notice that feature
1d2e... now has one of the shorted magnitudes (it's harder to see the label this time unfortunately). This illustrates that the magnitude that we can see in the first two dimensions doesn't reflect the magnitude in all of the dimensions used to compute the importance.
One thing that you can try, to get the magnitudes that you see to line up better with the importances, is to compute PCoA for only two or three axes. You can do this by calling
qiime diversity pcoa with the
--p-number-of-dimensions parameter set to 2 or 3. Then compute your biplot from that PCoA matrix. If you give this a try, I'll be very interested to hear if that changes your final interpretation of the data - if so, this might be something we want to suggest to users.