Qiime2R PCoA plotting and understanding explained variance

TukelSB · February 10, 2022, 6:08pm

Hello!

I've been using Qiime2R to plot PCoAfor my data. This is the code I have so far:

metadata<-read_q2metadata("Metadata.tsv")
wunifrac<-read_qza("weighted_unifrac_pcoa_results.qza")
shannon<-read_qza("shannon_vector.qza")$data %>% rownames_to_column("SampleID")

wunifrac$data$Vectors %>%
select(SampleID, PC1, PC2) %>%
left_join(metadata) %>%
left_join(shannon) %>%
ggplot(aes(x=PC1, y=PC2, color=Time-Point, size=shannon_entropy)) +
geom_point(alpha=0.5) + alpha controls transparency and helps when points are overlapping
xlab(paste(round(100wunifrac$data$ProportionExplained[1],2),"%")) +
ylab(paste(round(100wunifrac$data$ProportionExplained[2],2),"%")) +
theme_q2r() +
scale_size_continuous(name="Shannon Diversity") +
scale_color_discrete(name="Time Point") +
stat_ellipse() + #Adds confidence interval ellipses
ggtitle("Jackson C57BL/6 Mice i.p injected with 100 ug of curli twice a week")

I've managed to have PC1 and PC2 in my graph, which is great, but I was wondering if it were possible to add a third axis like in emperor in Qiime2R?

On a separate note, I was wondering if it were possible to figure out what exactly is causing the difference in my data points. I understand that PC1 and PC2 percentile is describing the proportion of difference in the data set, but is it possible to understand which featureID/taxa is causing this?

Please let me know!

Mehrbod_Estaki · February 10, 2022, 6:45pm

Hi @TukelSB,

Do you mean you want an interactive 3D plot in R, or just replacing one of the PC1/2 with PC3?

Yes and no. You can create a biplot which overlays the normal PCA with "feature loadings" using emperor biplot, which will give you feature loadings that tries to explain how much the features (ex ASVs) are contributing to building that PC. But is this an exact/perfect explanation of your data? Debatable. I recommend a good start to read on this here.

TukelSB · February 10, 2022, 7:37pm

Hello,

It doesn't have to be interactive, but I did want a 3D plot in R like in the figures from the link you had provided.

Yes, I have looked at the biplot and ASVs before, but ultimately, my committee wasn't satisfied with just numbers for my answer. Would that mean doing different analysis (Heatmap or taxa-bar) to look at species that are most likely contributing to the PC? I was told that it was possible to figure out what was specifically contributing to PC for RNA-seq, and so it should be possible for microbiome data. Otherwise, I will look for other ways to present my data!

Mehrbod_Estaki · February 12, 2022, 9:11am

Hi @TukelSB ,

Can you clarify what you mean by:

The biplot would look something like this:

This biplot is taken directly from the q2-DEICODE tutorial here and uses Emperor. The arrows are the taxa loadings and are not just numbers, they are visualized.

As for adding 3D plots, I would advise caution against this because a) more often than not the 3rd PC does not really add much information to the plot, though of course there are exceptions, b) 3d plots in publications can get rather messy and of course you can't really see in the 3rd dimensions anyways.
I'll also add that you can simply export the 3D plot from emperor as an .svg and modify it anyway you want using your favorite program (ex illustration, inkscape).

That being said, if you wanted to make a 3D version of this for exploring there are lots of options in R, one of them is with the plot_ly .

Here I'm just using the Moving Pictures tutorial data.

library(tidyverse)
library(qiime2R)
library(plotly)

#import data
metadata<-read_q2metadata("sample-metadata.tsv")
uwunifrac<-read_qza("unweighted_unifrac_pcoa_results.qza")
shannon_div<-read_qza("shannon_vector.qza")$data %>% 
  rownames_to_column("SampleID") 


#create joint dataframe
dat <- uwunifrac$data$Vectors %>%
  select(SampleID, PC1, PC2, PC3) %>%
  left_join(metadata) %>%
  left_join(shannon_div)

# 3D Plot with plot_ly
plot_ly(dat, x= ~PC1, y= ~PC2, z= ~PC3, 
        type="scatter3d", mode="markers", color= ~`body-site`, size= ~`shannon`,
        marker = list(sizemode = 'diameter'),
        text = ~paste('Body site:', `body-site`, 
                      '<br>Shannon Diversity:', `shannon`,
                      '<br>Antibiotic use:', `reported-antibiotic-usage`)
        ))

Which gives something like this:

test3d.html (4.8 MB)

If you want to add the biplot vectors that'll take some more customization but at that point it's honestly just easier to export and modify the Emperor plot.

TukelSB · February 17, 2022, 8:25pm

I was asked to specifically identify the cause of PC1 and PC2. I guess in cases of RNAseq, it is possible to find the gene responsible for the differences seen on the PCoA, but I am unable to identify the Taxon/feature ID of PC1 and PC2.

Can you clarify what you mean by:

Thank you for showing me how to plot the data in 3D! It is exactly what I needed.

Mehrbod_Estaki · February 18, 2022, 9:05pm

Hi @TukelSB,
I think I already answered your question above, so let me rephrase for clarity:

PCA/PCoA axis do not represent just a single feature, whether that is a gene, taxa, or something else. So when you say it is possible to find the gene responsible for the differences on a PCoA, that is not actually correct. All of your features contribute in some way to those PCs. You can however, as I suggested above, try to find the most important features by looking at their loading scores. This is what the vectors represent in the biplot. So in the example biplot I showed above, the yellow, purple, and light blue vectors (taxa) are strongly associated with the tongue microbiome, while the orange vector is strongly associated with the "palm" samples. You can choose how many of these vectors you want to display, and you can display all of them if you want (though not recommended).

As for your second question, I was actually the one asking you to clarify what you meant when you said your committee wasn't satisfied with just numbers. Given that I showed you a way to identify the actual taxa associated with those axis, I was confused what you meant by "just numbers".

I suggest you look into the concept of PCA/PCoA if you are going to be using them, I recommended this page as a good starting point.