How to make pcoa biplot in R using q2-deicode ordination


(Rahel Park) #1

Hi,
Thank you for a nice plugin and tutorial.
I was wondering if the visualization can be done in R as well? The emperor output is hard to export for modifications (the svg gets weirdly cut into pieces when imported to illustrator and the dots get merged together…). I imagine I can figure out how to make the PCoA ordination from the Aitchison distances, but not sure how to get the arrows for features on the plots. Have you done this and if yes, then could it be possible to have some guidance?
Thank you in advance.


Robust Aitchison PCA Beta Diversity with DEICODE
(cameron martino) #2

Hello @rahel_park ,

There is a way to move the ordination.qza output into R for biplot visualization. I would do this using the qiime2R tutorial provided by @jbisanz. Specifically the command

pco <- read_qza("unweighted_unifrac_pcoa_results.qza")

could be read in as

RPCA <- read_qza("ordination.qza")

if you are following the tutorial above.


(Rahel Park) #3

Thank you for guiding me to the qiime2R tutorial.
It doesn’t though look too straightforward to get to the PCoA plot including the feature arrows.
If anyone has already figured out a R script to get from the imported ordination qza file to a plot with the arrows, then could you please post your script here :wink:
Thank you!


(cameron martino) #4

Hello @rahel_park,

Looks as though @jbisanz solved this problem in the thread linked below. I hope this helps.


(Jordan Bisanz) #5

Here is a minimal example but @cmartino can you please advise on the most appropriate way of selecting the taxa and for scaling their lengths? I am not 100% what I did is valid.

library(tidyverse)
library(qiime2R)

ord<-read_qza("~/Downloads/ordination.qza")
meta<-read_tsv(file = "~/Downloads/metadata.txt") %>% 
      rename(SampleID=`#SampleID`) %>%
      filter(SampleID!="#q2:types")
tax<-read_qza("~/Downloads/taxonomy.qza")$data %>%
  rename(FeatureID=Feature.ID)


#create the base plot with only the arrows
baseplot<-
  ggplot() +
  theme_bw() +
  xlab(paste(round(100*ord$data$ProportionExplained[1],2),"%")) +
  ylab(paste(round(100*ord$data$ProportionExplained[2],2),"%")) +
  ggtitle("DEICODE biplot of Moving Pictures Data") +
  geom_segment(data=ord$data$Species %>% 
                 mutate(a=sqrt(PC1^2+PC2^2)) %>% # calculate the distance from the origin
                 top_n(8, a) %>% #keep 8 furthest away points
                 mutate(PC1=PC1*0.3, PC2=PC2*0.3) %>% # scale arrows linearly... is this ok? 
                 left_join(tax),
               aes(x=0, xend=PC1, y=0, yend=PC2, color=Taxon),
               arrow = arrow(length = unit(0.3,"cm"))
              )

# now overlay samples
baseplot +
  geom_point(
    data=ord$data$Vectors %>%
      left_join(meta),
      aes(x=PC1, y=PC2, fill=BodySite), 
      shape=21
      )


(Yoshiki Vázquez Baeza) #6

The selection of the top arrows looks very similar, in the case of Emperor we use the magnitude based on all the dimensions which should be very close to the calculation you are making.

As for scaling, in Emperor the arrows and samples are scaled by the largest value in each matrix. The objective is to have dimensions that are normalized between the features and samples. Note that there are other strategies that might work for this problem.

From what I recall, in vegan the plot space is created with 2 pairs of x and y axes. One of the pairs represents the samples and the other pair represents the features. I believe doing this is not too hard in R but I am not familiar enough with ggplot.


(Jordan Bisanz) #7

Using two independent sets of axes is a bit dicey in ggplot2, perhaps the solution is to just remove the axis labels all together.

plot + theme(axis.text=element_blank(), axis.ticks=element_blank())

(Sarah) #8

Is it weird that the plot I made with this code only has 2 arrows? Instead of 8?

Also, random question. In a separate file, and using my own code, I plotted PC1 and PC2 for my samples (not including the taxonomy information). This plot doesn’t show any separation of groups like it does using the code you provided. Do you possibly know why? I didn’t do any transformations on the PC1, PC2 information, just plotted it with colors specific to my sample types. I see that in your code there is a transformation but it only seems to be in relation to the taxonomy data. Any thoughts would be helpful!


(Jordan Bisanz) #9

Was this using your data or the moving pictures data? If your own data, have you checked your visualization against the q2 visualization? I would very carefully look at the values in data$Vectors and data$Species which define the X/Y plotting to see if they make sense versus what you are seeing in the biplot. If the Species data isn’t scaled it might compress all of your samples into the middle of the plot making it hard to see the groupings.


(Sarah) #10

Hmmmm…this is a great point. I think you’re right. How do you choose which scale to use? I see what you did in the previous example but not sure if there was a specific decision-making process you used.


(Jordan Bisanz) #11

I picked arbitrarily, but I think as Yoshiki suggested above you could calculate a scaling factor for the X axis and Y axis by dividing the maximum PC1 and PC2 values for the Sample data by the Species Data.


(Sarah) #12

@yoshiki, could you expand on scaling options?

Is there any other specific options that may be more useful than others? For example, I was thinking about sqrt(Eigvals$PC2/Eigvals$PC1) but don’t know if that’s appropriate. Can you advise?


(Yoshiki Vázquez Baeza) #13

Hi @smreyes, I don’t know what would make a scaling more appropriate than other. If you look at vegan’s help for the biplot.rda function, they list a few options:

Scaling for species and site scores. Either species ( 2 ) or site ( 1 ) scores are scaled by eigenvalues, and the other set of scores is left unscaled, or with 3 both are scaled symmetrically by square root of eigenvalues. With negative scaling values in rda , species scores are divided by standard deviation of each species and multiplied with an equalizing constant. Unscaled raw scores stored in the result can be accessed with scaling = 0 .