It is my first time trying mmvec and I only have 12 samples for the integration.
I did not specify a --training-column I ran the code as it is in github.
Are my results still reliable?
Thank you very much,
Looking forward to your response,
yea unfortunately its hard to do much with 12 samples. I'd definitely specify the training column.
It also looks like your model didn't reach convergence, so I'd run it longer.
How many of my samples would you recommend I should use for training the model? Also could you please tell me how I run it for longer?
What does it mean that the mode didn't reach convergence?
I am sorry to bother you, but I would really appreciate some guidance in how to run the model for longer.
I have used different settings for running the code, but I am not really sure if I am running it correctly, since it is still not reaching convergence. This is the code that I have ran
Hi @DanisaBescucci , not a problem -- the keyword is --p-epochs. It specifies the number of iterations the model takes to run. Right now, the number of epochs is set to 10 by default. Judging from your model, it looks like you haven't put a dent in training, so you may want to bump this up to 50 or 100.
Increasing the --p-learning-rate may also help (its right now at 1e-3, you could try 1e-1).
There is an extended discussion on both of these parameters in the FAQs
Your plot actually looks great -- your loss has flattened, which is what convergence should look like.
But your cross-validation plot on the top is still empty -- I'm not sure why that is. Maybe you don't have enough samples? How many testing samples did you allocate? (If you are comfortable with showing your data, that may help).
Dealing with 12 samples is generally tricky -- we have not been able to get MMvec to work in that setting. It may be worthwhile to look for other datasets to see if you could merge your dataset with another dataset (if you can). If not, you can't fit 2 latent dimensions with 12 samples -- you'll be lucky if you can fit it with 1 latent dimension.
I will try with 1 latent dimension. This information contains two different treatment groups and I could add 6 more samples that are from a control group. Would the model work the same way if I have three different treatment groups?
Sorry. I forgot to shared that one. Here is my metadata file. model-summary.qzv (36.1 KB)
I will add the extra samples that I have!
Thank you very much,
I have tried the model adding 12 more samples. That is a total of 24 samples now, and I can't get the first graph to show anything.
I have attached here my metadata file, the OTU table, metabolite table, and the qzv file with the summary.
The code that I have run is
How long does it take to run? Does this run complete within 1 minute? Its possible that the run is too fast (aka, it completes within 1 second and doesn't record anything). You could increase --p-epochs 1000 to double check this.
I think it'll be very important to look at the cross-validation loss to make sure that that also converges; particularly given how small the sample size is.
It always takes less than a minute to run even with --p-epochs 1000. However, I have run it with that setting and now the cross validation is there. I also changed the --p-latent dim to 0.
Does this look good to you? How would you interpret that the curves goes up at the end?
Sorry for all the questions,
Thank you very much for taking the time of troubleshooting this with me!
See the jump up at the very end -- that's a sign of overfitting. Your cross-validation score should be strictly increasing. So this summary is basically telling you that you still have too few species.
Also if you change --p-latent 0 that basically means you are only going to compute intercepts -- this is really only designed for hypothesis testing for the q2 score (also highlighted in the readme).
From what I'm seeing, there are too few samples in your dataset to draw meaningful conclusions with MMvec. Maybe with much bigger guns we can be able to answer this small sample size questions, but not at this exact moment.
I am still trying to run the model since I am doing this for an independent study. I have tried with a different data set. This time I have 42 samples.
When running the model at 50 epochs I get the following graph (canola_stats_50). However, if I increase the epoch number to 500 then I start seeing overfitting (canola_stats_500).
Thus, I was wondering how should I interpret the first results then? Could I use the conditional probabilities obtained when running the model at 50 epochs?
It looks like the cross-validation metric at the first few time steps aren't recorded.
Given how low the cross-validation metric is (and how little variability there is), this study is probably fine. But I would recommend following up with Q2 score, so that you can have a hard number to show that your model is statistically significant.