sample-classifier

Hello,

First, thanks to the developers of sample-classifier! I have implemented qiime sample-classifier regress-samples for continuous variable.

I’m wondering if an option would be available to specify which samples are divided for the training and test sets? For example, if you run an experiment where you take repeated measurements either through space or time on individuals, you would want to place some individuals in in your training set and other individuals in your test set. An example of this implementation in R can be found here. Wondering if something is in the works to implement this in qiime2?

warm regards,
Deanna

1 Like

Hi @DeannaB,

Could you please clarify your use case a bit more? Is this because you want to use the same splits for multiple tests, or because you want to train on one set of samples (say, baseline timepoint) and classify later timepoints?

As a workaround, both are essentially possible now — but in the case of the latter (explicitly choosing samples for training) you need to perform the individual steps instead of the regress-samples pipeline. So you could rig up a pipeline like this:

  1. use qiime feature-table filter-samples to select your training samples.
  2. same again to select the test samples. of course be very careful to make sure the train and test sets are totally exclusive, as manual selection is a risky approach!
  3. fit-regressor with the training table
  4. predict-regression with the test table
  5. scatterplot to visualize the results

Would this fit your need?

Hi @Nicholas_Bokulich,

Thanks for your feedback. I think the suggested pipeline should work! However, I do agree that manual selection of samples for the test and training sets can be risky, if not done thoughtfully.

To clarify my case use:
I have repeated measures on the same individual through space; samples taken at the same time. For example, I took two tissue samples per individual, 1 from a wound site, and 1 from a nearby but healthy site. I know from PERMANOVA analysis that factor ‘individual’ contributes to ~30% of variation in microbial beta diversity. Because these two samples (wound, and healthy) from each individual are not independent, it might be best to split the training and test sets with individuals falling into one group or the other but not both (e.g., it may be problematic if wound from individual A goes to training and healthy from individual A goes to test). This might be overly conservative in approach for the split! But, it may prevent overestimation of model performance.

Thanks again for the suggested pipeline. I’ll give this a try. And, if you or others have additional insights on this issue, I welcome them.

Warm regards,
Deanna

1 Like

Thanks for clarifying your use case @DeannaB !

This sort of issue — custom stratification — has been on my mind for a while, so hopefully in the (maybe relatively near?) future this can be accomplished automatically in the pipelines. I thought I had an open issue for this in the github repository but I did not, so I opened one:

This would support automated stratification (e.g., by subject ID and treatment group), so no need to specify manually. Hopefully I can get to this later this year — but if you are interested in contributing to the source code please let us know!

Glad to hear that the workaround should work for you! let us know if you have any more questions.