t-SNE or UMAP as q2 plugins

Stefan · April 14, 2020, 10:43am

Hi, does Qiime2 already have plugins for alternative ordination techniques to PCoA like t-SNE or UMAP? I might have a bachelor student interested in implementing a plugin for this studies. Would that be welcome, or are there already other ongoing efforts?

Nicholas_Bokulich · April 14, 2020, 2:44pm

Hi @Stefan,
Thanks for reaching out.
t-SNE: this has been on my mind for a while, I've been meaning to wrap in q2-sample-classifier but have not gotten around to starting on it. I'd welcome you to grab that issue or make a new plugin for this!
UMAP: a quick googling shows me that @gwarmstrong may be working on a plugin for this — @gwarmstrong is that still in development? let us know if @Stefan can get involved!

Stefan · April 14, 2020, 3:14pm

Hi @Nicholas_Bokulich, thanks for the prompt reply. I very much like the interactive exploration via Emperor, thus I thought to have something to replace pcoa: Principal Coordinate Analysis — QIIME 2 2020.2.0 documentation with either t-SNE or UMAP (and maybe others as well). What would be the best place to add this functionality? Which hyperparameters shoud we explose? I figure it would be best to wrap TSNE — scikit-learn 1.6.0 documentation
@gwarmstrong any help is very welcome. Let me know if you already made some design decisions for UMAP, we might just want to copy and paste to ensure a consistent API.

Nicholas_Bokulich · April 14, 2020, 3:46pm

Indeed! I like your plan, and having these methods output an ordination result of some kind would allow you to use this as input for emperor or other methods — note that q2-emperor takes a PCoAResults artifact as input, so let's get @yoshiki and @ebolyen in on this conversation: should we change emperor to accept a different sort of input, e.g., OrdinationResults? Or cheat and have t-SNE/UMAP output a PCoAResults artifact?

As I mentioned, you would be very welcome to put this in q2-sample-classifier following that open issue above unless if you wanted to create your own new plugin for this.

Sounds good, that's what I was planning to use in sample classifier. I think all of the options for sklearn.manifold.TSNE are worth exposing, but set useful defaults so that users don't need to fiddle too much to get something usable.

I'd recommend accepting a distance matrix as input... then any distance metric can be used, including metrics like unifrac that aren't available in sklearn. Actually, accepting a PCoAResults artifact as input could also be useful (per the note on that page that "It is highly recommended to use another dimensionality reduction method..."). So many possibilities!

yoshiki · April 14, 2020, 4:19pm

I like the idea of displaying t-SNE results using Emperor. The OrdinationResults object is rather flexible, and can probably do the job. However, I think it would also be fine to use a different format if that made more sense. In terms of the type, I think having a DimensionalityReduction parent type might make sense. Worst case, we can always have a qiime emperor plot-tsne visualization and handle a new type directly.

I am happy to help with testing, and debugging any visual artifacts that might come up on the plotting side of things.

Stefan · April 14, 2020, 4:47pm

Regarding the type: I think the sklearn vocable is "embedding" as a general result from any dimensionality reduction method. I don't want to break the current q2-Emperor input, but too me it looks like we would make q2-Emperor accept either an embedding (sklearn speech) or an OrdinationResult (skbio speech). Technically, the current format for Emperor should directly support t-SNE, MDS or others. I would welcome @Nicholas_Bokulich making a decision here as you have the best overview of whole data types in q2.

@yoshiki thanks for your help! From what I saw, t-SNE and UMAP are typically used to produce 2D plots. I tried it with Emperor and it worked, however the default spheres have a too big default radius. Is there a way to default to a smaller one, maybe via the inputfile?

Nicholas_Bokulich · April 14, 2020, 5:16pm

@ebolyen and I chatted out-of-loop and we think that you should just output a PCoAResults artifact for now... we can always update the method and q2-emperor later on to output/input a specific tSNEOrdination or some other more specific type later on if necessary.

gwarmstrong · April 14, 2020, 6:06pm

I have not actively been developing the plugin since the initial prototype a few months back. I would be happy to provide input on what I have done!

I think the author's implementation and documentation of UMAP is a good place to start. IIRC, there are upwards of 20 parameters to umap.UMAP, you probably really only need the basic parameters: n_neighbors, min_dist, n_components, min_dist to start. I would also recommend using random_state for reproducibility.

I am not sure that a consistent API with what I wrote is really necessary, AFAIK no one is using the plugin-draft I wrote.

Totally agree with this! In the plugin I wrote, I ended up exposing two avenues for interacting with umap.UMAP, one that would use a feature table and one that use a distance matrix. IIRC you could actually just have one interface that accepts something like (FeatureTable, Choices([<list of metrics>])) or (DistanceMatrix, Choices['precomputed']) with TypeMap! lmk if you want more guidance here.

Typically in the publications I have seen, these methods are used to make 2D plots. You can use them to make 3D plots and I was able to make some nice 3D UMAP visualizations. HOWEVER, if you make 3D plots with TSNE or UMAP, you cannot really just take the top 2 components to make a 2D plot, like you can for PCOA. My understanding is that the objective functions for these methods do not enforce anything special about a particular axis (unlike PCOA, which will order axes by eigenvalue, which is invariant to the number of components).

To do this via the interface:

Go to the Scale tab in your emperor plot.
Choose a metadata variable (doesn't matter what). Do not check "Change scale by value".
Adjust the 'global scaling' slider.

I am not sure if there is a way to set the default while generating the plot.

gwarmstrong · April 14, 2020, 6:10pm

This is what is done in biocore/deicode even though it is really an SVD and not a PCOA. So the precedent exists.

yoshiki · April 14, 2020, 8:49pm

Yes, defaults have been an ongoing work in progress. Happy to figure something out once you have some examples.

If anyone is interested, I would very much be game to try and run this on the browser with one of the JS implementations out there.

twollhoewer · October 31, 2020, 12:37pm

Hi, are there any updates on the implementation of UMAP and t-SNE? My master thesis is about dimensional reduction and I want to add both of them to the q2_diversity plugin. Are there any different approaches at the moment?

thermokarst · November 2, 2020, 2:15pm

Hey there @twollhoewer! Ccing @gwarmstrong.

Stefan · January 13, 2022, 8:51am

For the record:
We have successfully integrated t-SNE and UMAP computation into the core q2-diversity plug-in. Happy dimensionality reduction!
https://docs.qiime2.org/2021.11/plugins/available/diversity/tsne/
https://docs.qiime2.org/2021.11/plugins/available/diversity/umap/

gibsramen · January 13, 2022, 4:52pm

@gwarmstrong also published a great paper reviewing UMAP for microbiome data as well as some recommendations.

https://journals.asm.org/doi/10.1128/mSystems.00691-21