I was hoping you could help improve my understanding of this method and how to properly use it.
Correct me if I’m wrong but the main improvement here over using the aitchison distance alone is in the handling of zeros (matrix completion vs pseudocount). Do you have example cases of the improvement seen over aitchison with pseudocount? I saw the nice comparisons to jaccard and bray-curtis but not aitchison alone.
Second, I noticed the suggestion to not rarefy the data before running rpca. What is the effect of having one sample with 10X the depth of another sample? Does matrix completion work better when the zeros have an equal likelihood of being due to subsampling across samples?
Third, the other suggestion is to not collapse the data to the genus level. Could you expand on why this is the case? Many features cannot be annotated properly below the genus level so for plotting measures that use taxonomy, like the biplot, it is often helpful in my eyes to collapse to the genus level. Does this process skew the data in some form that interferes with rpca?
I have benchmarked this method to pseudocounts. However, the comparison is a bit of a strawman and did not end up in the final cut. A great resource on this topic is a comparison of zero handling methods by Justin Silverman et al. (https://www.biorxiv.org/content/10.1101/477794v1). There is an implementation with pseudocount based Aitchison in QIIME2 if you want to compare.
There should be no-effect but in some cases pre-filtering very shallow read depths may yield better results. The default in DEICODE is to filter samples with total reads less than 500 (see --p-min-sample-count) but again in many cases that may not be necessary. This all has to do with C-incoherence of the matrix, happy to point you to more literature if you are interested.
There are a few reasons, the main one being combining features in compositional space is not so simple. Visualizations of ranks and log-ratios based on the feature loadings are in the pipeline now for QIIME2. In the meantime, importing the ranks from the ordination file may be useful. This can be done either in python or R. Happy to send you more information in that direction if you are interested.
I have read the paper by Justin Silverman et al. I guess my main takeaway was that it depends on your data. Based on their paper it makes sense that most zero handling methods should be better than pseudocounts. I guess where my understanding is still lacking is how rpca handles datasets that are mainly sampling zeros vs biological zeros. Does it work well in both situations or only one? This is why I asked about whether having all samples at the same sequencing depth would help as it evens out the likelihood of zeros being due to sampling.
The key to matrix completion is the convex relaxation made by the low-rank assumption. The low-rank assumption also accounts for structural zeros, while intra-group sampling zeros are being imputed.
So the ratio between zero classes should not affect this method as long as there is some underlying low-rank structure to the data. The breakpoint here is if the underlying data is high-rank. This can happen in cases where there is a gradient between samples and features. Two examples of this are shown in the paper (Fig S1), the data for the 88 Soils (a perfect example of high-rank data) can be found here (https://qiita.ucsd.edu/study/description/103).
Hope that helps your understanding. Happy to send you more matrix completion literature if you are interested.
Sorry, one other question. When I use the qiime plugin I now I understand that the --p-rank parameter defines the output rank of the matrix which means the total variance of your axes in PCA will be defined by that number (rank # = number of PCA vectors to sum to 100% variance). When I analyze my own dataset I find that if I use the default rank setting of 3 that the axes explain 47%,27%,26% of the variance respectively. When I rerun this analysis with --p-rank set to 5 the axes explain 35%, 19%, 18%, 14%, 13% of the variance respectively. The clustering is generally similar but definitely easier to see in the --p-rank 3 plot since all of the variance is forced into the first 3 axes which we can visualize easily.
People often use the variance explained in the first three axes to determine how useful the ordination plot is, visually speaking. However, I feel like people would get different interpretations if they thought the data could be explained completely in the first three axes vs lower and lower amounts depending on the rank parameter chosen. Why should rank 3 be sufficient in most studies? Should this value be dependent on the number of metadata conditions you expect like in your paper? If you expected two outcomes but in reality there were four would forcing the rank to 2 cause overfitting?
Yes, the percent explained variance will spread out as you increase the rank. However, the input rank does not limit the ordination to only three clusters (could be less or more), similar to PCA. So, to answer your question, no that should not cause overfitting. With the caveat being very high-rank (gradient like) data such as the 88 Soils dataset. The only non-gradient case when you may need to increase the rank is if you have many samples over many environments such as the full EMP or American Gut datasets. These cases will be very evident in the ordination.