mmvec model convergence and Q2 issues

ceraker · August 19, 2024, 8:48pm

Hello,

First of all, thank you for your support in the past: I finally have mmvec installed properly and working with my own data.

I'm just a bit concerned because my convergence summaries look very different from the example data, as well as from other people's data. I've adjusted various parameters, and I'm just not sure what's going on! I have almost 200 samples, so I don't think that's the issue. My Q2 value is also consistently negative. I realize this might just mean that there isn't a significant relationship in my data, but given my preliminary statistical analysis, I don't think that's the case.

The code I used to run the model that gave me the "best" results:

qiime mmvec paired-omics \
        --i-microbes microbiome_done.qza \
        --i-metabolites metabolites.qza \
        --p-summary-interval 1 \
        --p-input-prior 0.1 \
        --p-output-prior 0.1 \
        --p-latent-dim 15 \
        --p-min-feature-count 20 \
        --output-dir model_summary5

And results, with a Pseudo Q-squared of -0.030503:

What am I missing here, or is my data just...bad?

Thank you so much, this forum has been incredibly helpful.
meta_samp_met.txt (10.8 KB)
https://app.box.com/s/iupsgehpo02howpf7glc07ktg72j5h9p

mortonjt · August 23, 2024, 1:24pm

Hi, it is hard for me to say. The one thing that stands out is that latent-dim=15 is really high. I typically don't much beyond latent-dim=10, particularly for 200 samples.

I'm also wondering if the priors are hurting you here. If you have 200 samples (and your dimensionality isn't crazy high), then you can be more lenient with those priors, and set them higher (i.e. 1). The higher the prior, the less constrained the model is.

Of course, this is highly dependent on how many features you have in your microbe / metabolite tables. If you have more than 10K features, it can be tricky to get a good fit with 200 samples

ceraker · August 29, 2024, 12:07am

Thank you so much, this was actually really clarifying. Lowering the latent-dim and increasing the priors helped, but there was still a lot of overfitting.

I actually ended up dividing my data up by treatment group, and have been having better luck that way. However, some of the paired results have shown and...odd representation of the null model, and I'm not sure what to make of it.

The model code was:

qiime mmvec paired-omics \
        --i-microbes microbiome_done_CX.qza \
        --i-metabolites metabolites.qza \
        --p-training-column train \
        --p-learning-rate 1e-3 \
        --p-summary-interval 1 \
        --p-input-prior 0.1 \
        --p-output-prior 0.1 \
        --p-epochs 200 \
        --p-batch-size 5 \
        --p-latent-dim 2 \
        --p-min-feature-count 5 \
        --output-dir model_summary_CX_3

And the paired summary looks like this, with a pseudo Q-squared of 0.750141:

Thank you again for all of your help (as well as your reply to my earlier post: it was indeed an issue with sample naming!).