Songbird model building and diagnostics


I've had a few questions about Songbird models, and @mortonjt recommended that these questions might benefit other members of the community, so here are the questions that are already answered:

Question: Q2 value between null and full model is close to zero (0.001), however, beta diversity analysis shows significant differences between groups of interest. How to interpret such Songbird results?

Answer: "If you have significant beta diversity results, I would weigh those results more heavily than the q2 score."

Question: is an unbalanced design between groups a problem for Songbird models?

Answer: "Yes, an unbalanced design does have the risk of biasing your results. But hopefully the cross validation should provide some insights on how bad this bias is."

Question: * Should increasing the number of testing samples be accompanied by increasing the --p-batch-size?

Answer: "--p-batch-size is mainly used for training. It is ok to stick with the defaults."

Question: If you use the "Testing" column in full models, should you also include it in the null model?

Answer: "Right, ideally you would use the same train/test splits for both models (your model and the null model)"

Question: Is using a taxon that is common in most samples and has ASVs ranked both as positively and negatively associated with the variable of interest as the reference frame is a good option?

Answer: "Certainly not a bad idea since you'll be able to generate more log-ratios in qurro. Although the interpretation can be a bit wonky without abs abundances, at the very best you can sort the microbes by log-fold change, so positive / negative associations could be misleading if your assumptions are off."

Since the first Q&A I've had a two more questions arise:

  1. In the unbalanced design (79, 65, and 29 animals), using the --p-training-column parameter, I included an equal number of samples from each group (18 test samples in total) and the CV score was higher (85) than when using the same number of Testing samples chosen by Songbird (75). Does this mean that it's better to use --p-num-random-test-examples than --p-training-column? Or should the samples assigned to the testing group better represent the unbalanced nature of data, i.e. instead of having an equal number of test samples have a proportional number of samples held for testing (e.g. 8,7 and 3). Is this the bias created by the unbalanced sample design mentioned before?

  2. Does the low Q2 value warrant further confirmation of the ranked differentials? For example, in a way that has been done by Taylor in this paper?


  1. If you want to generate reliable Q2 values, you do need to use --p-training-column, otherwise you could have inconsistent train/test splits. Regarding the unbalanced design, I’d first run with just the raw data. The problem of unbalanced data stems from how confidence intervals are performed in traditional statistical tests – I don’t think it is as big of a problem here.
  2. Yes, it is always better to get additional validation.
1 Like

I have a few questions about your answer, to make sure I got it right:

What do you mean with just the raw data in reply one? To just stick with the split as is? I've assigned the 18 test samples partially randomly using Excel and making sure there isn't an accidental sex bias (e.g. all animals being male or female).

In the article I mentioned the authors say that they used permutation tests to make sure that the log-ratios are not random. Is that sufficient in the case of Q2 around zero? Is there a way to verify that the differentials are non-random, or is using this method of looking at log-ratios enough?

Thanks so much for your help!

  1. Yes that is fine.
  2. Erm ... I'm personally not a big fan of permutation tests (they often mask the underlying issue, but don't actually resolve them). But if you are going to use it just for the Q2 score, then that should be ok.

Hi again,

I keep wondering about the additional verification of the Q2 score. I now have models with Q2 of -0.01, but beta diversity shows slight but significant differences between groups. The article I mentioned before used the permutations on randomly selected log-ratios, so I'm wondering:
a) if that (permuting random log-ratios) is a good option in my case
b) if I could use something to verify that the ranking is non-random. What I had in mind is assigning a random rank (on the same scale as differentials) to the differentially ranked ASVs and then doing a correlation test showing that the ranking produced by Songbird is not random. Or would that not help or serve the purpose in this case at all?

As it might have become clear, stats aren't my main strength and it might be that this is more a stats than a Songbird question, however, I'm positive many people might have similar questions about how to best proceed in such cases.

Right, there are many ways to do the permutation test.

Shuffling the taxa actually won't make a difference here since you are just reassigning names -- you'll get the exact Q2 scores. You'll want to shuffle the samples.

1 Like

Okay! I thought that exporting the differentials from the full model for that specific variable and assigning the differentially ranked ASVs a random value would show that it's not random how the ASVs are ranked for the given variable, but that if I understand correctly doesn't say anything about the Q2.

So with shuffling the samples, do you mean changing the metadata passed into the model, e.g., so that sample assigned to group A is now in group B or group C at random and then running that model and seeing how that model ranks ASVs compared to the full model and comparing their Q2 or differential values? Or something completely different?

yes that is correct :slight_smile:

Awesome! Thanks for all the help! :slightly_smiling_face: