I've had a few questions about Songbird models, and @mortonjt recommended that these questions might benefit other members of the community, so here are the questions that are already answered:
Question: Q2 value between null and full model is close to zero (0.001), however, beta diversity analysis shows significant differences between groups of interest. How to interpret such Songbird results?
Answer: "If you have significant beta diversity results, I would weigh those results more heavily than the q2 score."
Question: is an unbalanced design between groups a problem for Songbird models?
Answer: "Yes, an unbalanced design does have the risk of biasing your results. But hopefully the cross validation should provide some insights on how bad this bias is."
Question: * Should increasing the number of testing samples be accompanied by increasing the --p-batch-size?
Answer: "--p-batch-size is mainly used for training. It is ok to stick with the defaults."
Question: If you use the "Testing" column in full models, should you also include it in the null model?
Answer: "Right, ideally you would use the same train/test splits for both models (your model and the null model)"
Question: Is using a taxon that is common in most samples and has ASVs ranked both as positively and negatively associated with the variable of interest as the reference frame is a good option?
Answer: "Certainly not a bad idea since you'll be able to generate more log-ratios in qurro. Although the interpretation can be a bit wonky without abs abundances, at the very best you can sort the microbes by log-fold change, so positive / negative associations could be misleading if your assumptions are off."
Since the first Q&A I've had a two more questions arise:
In the unbalanced design (79, 65, and 29 animals), using the --p-training-column parameter, I included an equal number of samples from each group (18 test samples in total) and the CV score was higher (85) than when using the same number of Testing samples chosen by Songbird (75). Does this mean that it's better to use --p-num-random-test-examples than --p-training-column? Or should the samples assigned to the testing group better represent the unbalanced nature of data, i.e. instead of having an equal number of test samples have a proportional number of samples held for testing (e.g. 8,7 and 3). Is this the bias created by the unbalanced sample design mentioned before?
Does the low Q2 value warrant further confirmation of the ranked differentials? For example, in a way that has been done by Taylor et.al. in this paper?