Dear all, I am trying to run Gneiss following the available tutorial. It works just fine for a subset of my data - 18 samples containing 162 different features after some filtering. (I have removed features containing less than 5 sequences, and those found in only 1 sample. After creating partitions with Gneiss I get 15 - y0-y14. It all looks great, and I want to do the same for my main dataset.
So running the exact same commands on a larger dataset - 72 samples, 903 features I end up having 903 y’s. And now I am wondering if anyone can help me understand what is happening and what is not happening I have run the same samples without filtering features post-dada2, and the same problem is seen.
Hi @stangedal. I’m having a little bit of trouble parsing your question. Could you post your code example with some screenshots of your results? Right now the clustering is a bit brute force – if you have D features, gneiss will calculate D-1 balances with a given a tree.
In my other attempt running these commands I used a dataset consisting of only 18 samples, 162 features and 766977 sequences. I have run the exact same commands, just using another qza file as input. The regression summary file gives the following:
What suprised me was that it is so many y in the larger dataset, making the regression coefficients summary impossible to read/see. While my smaller dataset leaves me with y0-y14. I thought something went wrong in the process when I saw the 903 y's in my larger dataset... I guess my question is - does the regression coefficients summary in the first example look right to you?
I typically don’t rely on looking at the heatmap for sanity checking. First I’d recommend taking a look at the R^2 values and the MSE cross-validation values to sanity check how good the fit is, and if there is any overfitting.
Once you have established that you are getting reasonable fits, then you can start looking at the heatmap in the regression summary to start tweezing out which balances could be interesting.
The outline of these steps can be found here. Does that help guide how to start looking at these plots?