Comparing Only Categories of Interest Within Songbird Formula

Guneet_Janda · February 9, 2021, 5:33am

Hi all,

Running songbird for differential abundance analysis and the variable of interest is a categorical variable with the following labels: healthy controls, untreated, 1 month treatment, and 6 months treatment. I've been trying to create a model using songbird to see which microbes are changing from each group but so far my models are relatively poor performing, and I am unable to get loss to decrease and for the R-squared to increase.

I was wondering if I could just somehow specify in the model formula to only compare the healthy controls and untreated, to see if the model can at least discriminate that. That is the major comparison I am interested in, but I don't know how to specify in the formula to just ignore the treatment groups. Judging from our other data, I'm worried that the treated groups look too similar to the healthy controls which is leading to the poor model performance. Any help in designing a formula or other comparison ideas would be appreciated!

Thanks!

mortonjt · February 10, 2021, 3:59pm

Hi @Guneet_Janda, just to confirm, did you run any beta diversity? Did you see any differences at a global scale. Because if beta diversity fails to find a signal, then songbird will also likely fail.
Otherwise, you may need to decrease the learning rate (i.e. 1e-4).

Regarding the second question, what have you treated? If you just want to look at 1 categorical variable, you can just specify the column in the formula -- if you are looking to exclude some samples, see qiime feature-table filter-samples

Guneet_Janda · February 10, 2021, 4:56pm

Hi, thanks for the quick response! I did run beta diversity and saw a significant difference between the control group and the untreated group.

Regarding treatment, I should've explained better initially but I basically have two cohorts, a healthy control group and an untreated patient group. The patient cohort was given treatment and I have samples from various timepoints as treatment continued. What I want to do is look at the differentials between just the control and the untreated group and then see how those ratios change with increased treatment time. So I would like to keep in the samples that are outside of the control and treatment group, but I don't want the songbird model to actually try to discriminate differences there.

Right now the model is unable to perform well, but since the beta and even alpha diversity metrics are significantly different between the untreated and control samples and not very different between the control and treated samples my guess is that its having issues discriminating between the control and treated samples, leading to the poor performance. I want to have a model that performs well on just a subset of the samples (control and untreated) but then see how those differentials change with time (in the treated).

Let me know if that makes sense, and thank you in advance for your help!

mortonjt · February 10, 2021, 5:15pm

Yes you can exclude samples, but you can also modify the formula.

For the next post, make sure to post the commands you used and the diagnostic plots, it'll allow us to give more helpful feedback.

Guneet_Janda · February 11, 2021, 3:16am

Ahh I see - I think I misunderstood the 'differential.qza' output. I thought it gave differentials for each sample, but it seems it gives the differentials for each feature between each comparison group. I see why excluding samples makes sense now, thank you!

I excluded the samples that I didn't want the model to train or test on, and the following is the code and output:

sb = songbird.methods.multinomial(
l7_presortbl,
metadata,
'Diagnosis',
training_column=None,
num_random_test_examples=7,
epochs=10000,
batch_size=5,
differential_prior=1e-10,
learning_rate=0.0005,
clipnorm=10.0,
min_sample_count=1000,
min_feature_count=10,
summary_interval=1,
random_seed=0,
silent=False,
)

sb_null = songbird.methods.multinomial(
    l7_presortbl,
    metadata,
    '1',
    training_column=None,
    num_random_test_examples=7,
    epochs=10000,
    batch_size=5,
    differential_prior=1e-10,
    learning_rate=0.0005,
    clipnorm=10.0,
    min_sample_count=1000,
    min_feature_count=10,
    summary_interval=1,
    random_seed=0,
    silent=False,
)

songbird.visualizers.summarize_paired(sb.regression_stats, sb_null.regression_stats).visualization

The covariate model doesn't seem to be performing any better than the null, and the loss function definitely looks a bit strange. The pseudo Q-squared is -0.004248, which seems rather low. As you can tell from the code, I played around with the differential_prior and the 'learning_rate'. Any tips you can give to better help model fitting would be appreciated, and again thank you for your time!

mortonjt · March 1, 2021, 5:05pm

ok, your differential prior is a little too small -- you are basically telling your model that you want all of the parameters to be the same. You'll want to have this to be something like 1, to make it more reasonable. How many samples do you have in this study? If you have >10, you don't need to have a very small prior.