Songbird with continuous variable measurement only on subset of samples

Mike_McFarlin · July 12, 2021, 12:47am

Hello,

I am analyzing cloacal samples from two bird species and I am interested in assessing whether features differ in abundance based on two continuous variables. I have a total of 73 cloacal samples but only 16 of my samples have measurements of the continuous variables. These measurements are from tracking data on these 16 birds. I am curious as to how Songbird works with a variable that is only found in a subset of the sequenced samples and if this a proper use of the tool.

After running the model below and comparing to the null I get a Q2 score of ~0.424 and it looks like my model is not overfit based on the figures below. Sidenote: I do not know why the null ran 1 million iterations and the model ran 200,000. Both were set to run 100,000.

qiime songbird multinomial
--i-table Body-site-feature-tables/Cloaca-table.qza
--m-metadata-file GullMetadata_forQIIME_updated_medianValues.tsv
--p-formula "LandPro+medFI+Species"
--p-epochs 100000
--p-differential-prior 1
--p-summary-interval .1
--p-num-random-test-examples 5
--p-random-seed 3
--o-differentials Songbird/differentials-both-sp.qza
--o-regression-stats Songbird/regression-stats-both-sp.qza
--o-regression-biplot Songbird/regression-biplot-both-sp.qza

I then used qurro to examine differential features and built a plot with features found across one of my continuous variables. In the plot you can see below, only samples that had a measurement for the continuous variable were plotted, which makes sense.

However when I exported the data used to make the plot I noticed that a log ratio was also recorded for samples that did not have any continuous data measurements. How should I interpret this? Are my 16/73 samples with continuous data enough to identify differentially abundant features across all samples or am I reaching here? Is it safe to conclude that my samples without continuous data actually have the relationship between features indicated by the log ratio?

Thank you for your time and assistance and to the researchers for building these amazing tools!

-Mike

mortonjt · July 12, 2021, 5:14pm

I don't know what is going on with the number of iterations other then the possibility that you have different batch sizes -- if your null model batch size is 10x smaller than your other model, that would explain it.

Also you are using --p-num-random-test-examples which isn't recommended if you are doing null modeling -- you should be specifying the training column. See the README for more details.

I don't understand your question about the continuous data -- is this for the LandPro column? You are probably going to have problems there, since null is going to be force the column to be recognized as a categorical variable. If you really want to model continuous values, you'll need to drop all of the nulls, since Songbird can't model missing data.

dwt · July 12, 2021, 5:25pm

Hey Mike,

73 / 16 is suspiciously close to the difference in iterations, so I would guess that your null model is maybe seeing the whole dataset, and your trained model only the subset with measurements.
If that is the case you should probably run both on a table filtered to just those 16 samples.

Also to echo Jamie you should specify your test-train split, and usually you want to use qurro to see your test performance, so you should run it a table of your test samples, not the whole table.

Mike_McFarlin · July 12, 2021, 10:16pm

Thank you so much for your responses.

Apologies @mortonjt , I didn't explain that well. Yes, the "LandPro" variable is my continuous variable. As you suggested, I removed all the samples that did not have a measurement for this variable.

I added a column to specify the training and testing samples. Is using column specified training and testing samples only recommended when comparing to the null model? After comparing models, should I then re-run the model with random test samples? Sorry, the "if" in your statement has me a bit confused.

Hi @dwt, I believe you were correct about my null model running on the entire dataset and the trained model running on the subset with measurements. After subsetting, they both have the same number of iterations. Regarding qurro...

Do you suggest I modify the metadata input file for the qurro visualization command to only include the test samples? Something like this...

*--m-sample-metadata-file Metadata_file_testing_samples.tsv *

Thank you both again for the assistance!

mortonjt · July 12, 2021, 10:43pm

You should specify a training column if you want to generate reliable Q2 values.

Regarding Qurro that's a good question. If you want to rigorous about overfitting, it isn't a bad idea to only look at your training data, and see how this prediction holds for the test data.

Mike_McFarlin · July 14, 2021, 6:18pm

If the training and test data both showed the exact same features that would likely indicate overfitting, correct?

Thanks!

mortonjt · July 14, 2021, 6:32pm

erm, no ... ? The features present only depends on your filtering criteria -- if you have the same filters and the same microbes are present in both datasets, then this should be fine right?

The train/test evaluation is more about seeing if the log-ratios are the same, or have roughly the same prediction.

Mike_McFarlin · July 14, 2021, 6:33pm

Ah ok, that makes sense. Thank you!

fedarko · July 15, 2021, 2:52am

Hi Mike -- It looks like your questions have been answered pretty thoroughly (thanks Jamie and Devin!), but just to address some points you brought up for reference:

This is just an artifact of how Qurro stores the data internally. It looks like it was able to compute log-ratios for almost all of your samples (70 / 73, judging by the screenshot you posted), including some samples without a LandPro measurement—but since the x-axis field in your sample plot is set to LandPro, these samples can't be shown. I'm pretty sure if you select another metadata field for the x-axis (e.g. Species) then these samples, or at least the subset of these samples that have a defined Species value, would appear in the plot.

...But that being said, if you're only interested in differential abundance with respect to samples with a defined LandPro measurement (and it sounds like this is the case), it probably makes sense to remove these samples before running Songbird as Jamie suggested.

It should be fine. These samples all have defined log-ratios, at least (i.e. they don't have a zero in the numerator or denominator), so there's no reason we can't compute this log-ratio for these samples.

However, it's important to note that since this log-ratio is based on differential abundance results produced by Songbird for a different set of samples (i.e. the samples that had defined LandPro measurements), computing this log-ratio for the samples with undefined LandPro measurements might not be useful / relevant. (Or maybe it could be—you could imagine, e.g., building a classifier that uses this log-ratio to try to assign a LandPro measurement to an unlabeled sample. But... it doesn't sound like that's what you're going for here, though )

edit: I should also mention that it looks like there are only 70 features in the dataset. This seems to me like a relatively low number of features -- it's not necessarily a bad thing, but it might be worth double checking that the upstream parts of the analysis are working as expected. If the reason for there being relatively few features is that this is 16S / metagenomics data represented as a collapsed table based on taxonomy or something like that, I think it would probably be best to run differential abundance using the uncollapsed table (but either approach is justifiable).

Mike_McFarlin · July 16, 2021, 8:49pm

Hi @fedarko ,

Thank you for your comments. This gives me a better idea of how Qurro is working with the Songbird data.

I was thinking about this exact use, I think it would be very interesting to use this as a classifier. Though given Jamie's comment that having null values would force the column to be recognized as a categorical variable, do you think this would make the current output of the Songbird model uninformative on the relationship of the continuous variable to these features? Since there would be a category for each of the LandPro values, and another null category that would include samples that are likely a combination of all the other categories.

Thank you for this comment! I'll double check my input files.

fedarko · July 16, 2021, 9:24pm

No problem!

Yeah, the "forced categorical" handling of this variable because of null values will make Songbird's output not useful; since this is a continuous variable, it should be treated as one. (Unless it really does make sense to treat each unique LandPro value as a distinct category, but it doesn't sound like it.) As Jamie mentioned, the way around this will be filtering the samples with null LandPro values before running Songbird (this is assuming that the remaining samples won't be a problem for the other variables in your formula).

It should be feasible to run Songbird on the filtered set of samples that have non-null LandPro values, then run Qurro on the resulting differentials using the full set of samples (or even a completely different group of samples than the one you gave Songbird). This way, it's possible to look at the log-ratios of samples that weren't used to generate the differentials in Songbird -- e.g. if you want to see what the log-ratios look like for just the null LandPro samples, or if you're running Qurro using a "test" set of samples compared to what you gave Songbird.