Thank you for a nice plugin and tutorial.
I was wondering if the visualization can be done in R as well? The emperor output is hard to export for modifications (the svg gets weirdly cut into pieces when imported to illustrator and the dots get merged together…). I imagine I can figure out how to make the PCoA ordination from the Aitchison distances, but not sure how to get the arrows for features on the plots. Have you done this and if yes, then could it be possible to have some guidance?
Thank you in advance.
Hello @rahel_park ,
pco <- read_qza("unweighted_unifrac_pcoa_results.qza")
could be read in as
RPCA <- read_qza("ordination.qza")
if you are following the tutorial above.
Thank you for guiding me to the qiime2R tutorial.
It doesn’t though look too straightforward to get to the PCoA plot including the feature arrows.
If anyone has already figured out a R script to get from the imported ordination qza file to a plot with the arrows, then could you please post your script here
Looks as though @jbisanz solved this problem in the thread linked below. I hope this helps.
Here is a minimal example but @cmartino can you please advise on the most appropriate way of selecting the taxa and for scaling their lengths? I am not 100% what I did is valid.
library(tidyverse) library(qiime2R) ord<-read_qza("~/Downloads/ordination.qza") meta<-read_tsv(file = "~/Downloads/metadata.txt") %>% rename(SampleID=`#SampleID`) %>% filter(SampleID!="#q2:types") tax<-read_qza("~/Downloads/taxonomy.qza")$data %>% rename(FeatureID=Feature.ID) #create the base plot with only the arrows baseplot<- ggplot() + theme_bw() + xlab(paste(round(100*ord$data$ProportionExplained,2),"%")) + ylab(paste(round(100*ord$data$ProportionExplained,2),"%")) + ggtitle("DEICODE biplot of Moving Pictures Data") + geom_segment(data=ord$data$Species %>% mutate(a=sqrt(PC1^2+PC2^2)) %>% # calculate the distance from the origin top_n(8, a) %>% #keep 8 furthest away points mutate(PC1=PC1*0.3, PC2=PC2*0.3) %>% # scale arrows linearly... is this ok? left_join(tax), aes(x=0, xend=PC1, y=0, yend=PC2, color=Taxon), arrow = arrow(length = unit(0.3,"cm")) ) # now overlay samples baseplot + geom_point( data=ord$data$Vectors %>% left_join(meta), aes(x=PC1, y=PC2, fill=BodySite), shape=21 )
The selection of the top arrows looks very similar, in the case of Emperor we use the magnitude based on all the dimensions which should be very close to the calculation you are making.
As for scaling, in Emperor the arrows and samples are scaled by the largest value in each matrix. The objective is to have dimensions that are normalized between the features and samples. Note that there are other strategies that might work for this problem.
From what I recall, in vegan the plot space is created with 2 pairs of x and y axes. One of the pairs represents the samples and the other pair represents the features. I believe doing this is not too hard in R but I am not familiar enough with ggplot.
Using two independent sets of axes is a bit dicey in ggplot2, perhaps the solution is to just remove the axis labels all together.
plot + theme(axis.text=element_blank(), axis.ticks=element_blank())
Is it weird that the plot I made with this code only has 2 arrows? Instead of 8?
Also, random question. In a separate file, and using my own code, I plotted PC1 and PC2 for my samples (not including the taxonomy information). This plot doesn’t show any separation of groups like it does using the code you provided. Do you possibly know why? I didn’t do any transformations on the PC1, PC2 information, just plotted it with colors specific to my sample types. I see that in your code there is a transformation but it only seems to be in relation to the taxonomy data. Any thoughts would be helpful!
Was this using your data or the moving pictures data? If your own data, have you checked your visualization against the q2 visualization? I would very carefully look at the values in data$Vectors and data$Species which define the X/Y plotting to see if they make sense versus what you are seeing in the biplot. If the Species data isn’t scaled it might compress all of your samples into the middle of the plot making it hard to see the groupings.
Hmmmm…this is a great point. I think you’re right. How do you choose which scale to use? I see what you did in the previous example but not sure if there was a specific decision-making process you used.
I picked arbitrarily, but I think as Yoshiki suggested above you could calculate a scaling factor for the X axis and Y axis by dividing the maximum PC1 and PC2 values for the Sample data by the Species Data.
@yoshiki, could you expand on scaling options?
Is there any other specific options that may be more useful than others? For example, I was thinking about sqrt(Eigvals$PC2/Eigvals$PC1) but don’t know if that’s appropriate. Can you advise?
Scaling for species and site scores. Either species (
2) or site (
1) scores are scaled by eigenvalues, and the other set of scores is left unscaled, or with
3both are scaled symmetrically by square root of eigenvalues. With negative scaling values in
rda, species scores are divided by standard deviation of each species and multiplied with an equalizing constant. Unscaled raw scores stored in the result can be accessed with
scaling = 0.
Thanks for updating r code for DEICODE. Great help to R users.
I’m confused about the way the top 8 features were selected.
mutate(a=sqrt(PC1^2+PC2^2)) %>% top_n(8, a)
In the DEICODE tutorial, it wrote
*The important features with regard to sample clusters are not a single arrow but by the log ratio between features represented by arrows pointing in different directions. *
Could you explain a little bit more on this?
Another problem I’m struggling with is the heatmap drawing as the paper by Cameron. In Fig5A, rownames are sample names; colnames are features; values are feature loadings. But how can we match the sample name to feature names with loading?\
I think those are questions best answered by @cmartino!
Qurro is a tool developed by @fedarko to help you do this form of analysis easily within QIIME2. Here is a tutorial that goes into depth on your question using a simple mock example dataset. Here is a more in-depth tutorial on choosing log-ratios of features with RPCA biplots & loadings on the moving pictures dataset. I am happy to answer any questions on those tutorials.
We do not have any built-in functions for this. I would lean heavily to using log-ratios and Qurro instead. However, the procedure is not too complicated if you would like to try it:
- Find the PC axis that
clustersthe samples by some metadata (in the paper one example is Healthy/Stressed sponges on PC1)
- Extract the sample and feature loadings for that axis. (this is the PC axis loading used in the scatter and arrow for samples and features respectively).
- Sort the CLR transformed
FeatureTable[Frequency](can use a pseudo count here) by those loadings from 2.
- Plot the sorted table as a heatmap.
I hope this helps. Let me know if you have more questions.
P.S. Thanks for tagging me @jbisanz.