Metadata sample classifier- information on number of samples for training purposes

Deni_Ribicic · May 15, 2022, 7:37pm

Hi there,

Is there a way to get information from sample classifier output on number of samples used for training the model?

I am aware that it is possible to obtain information on number of samples used for evaluation of the model- located in predictions.qza artifact.
However, It would be beneficial also to have the same information on number of samples going into algorithm for training purposes, especially for subsequent reporting.

Nicholas_Bokulich · May 16, 2022, 11:00am

Hi @Deni_Ribicic ,
This info could be obtained in a few indirect ways:

number of training samples = total number of samples - # of evaluation samples
number of training samples = total number of samples * (1 - test_size)
if using a classifier, count the number of samples in the training_targets output.
if using one of the nested classifiers, all samples are used for training an n-1 folds, and testing in 1 fold. The number of training samples per fold will be calculated as in step #2 above.

This info could also be simply added to the model_summary outputs if you want a fast + explicit place to look, and are interested in contributing to the source code

It would be impossible to register this information in provenance directly (since it is not set by a separate parameter), but the test_size parameter is how this information is effectively stored in provenance (assuming that the total number of samples is also known).

Deni_Ribicic · May 16, 2022, 1:55pm

Hi @Nicholas_Bokulich ,

Thanks for the input.

I am ashamed to admit I haven't updated qiime2 for a while , hence did not have training_targets as standard output from the run. This solves my problem indeed.

But just out of curiosity, how would one obtain the same info when predicting numerical (continuous) data (regress-sample) (I can see there is no training_targets output here). Let's say a huge dataset (100s of samples), with metadata missing some values, so it becomes a more tedious task to track how many samples in total one has going into model (therefore your suggestion #1 wouldn't be that optimal, I assume)

Nicholas_Bokulich · May 16, 2022, 6:46pm

Point number 1 or 2 above.

Agreed, this is less straightforward with missing values but still traceable for reporting purposes, provided that you did not update the metadata file to replace some values. I am not sure why regress_samples does not output training_targets... this could probably be corrected easily enough.

Nicholas_Bokulich · May 17, 2022, 9:01am

Hi @Deni_Ribicic ,

Quick update: I opened an issue for this on GitHub, so this functionality will be added in a future release of q2-sample-classifier. You can watch that issue on GitHub if you are interested in tracking progress on this (and as always contributions to the source code are welcome ).

Thanks for bringing this up!

system · June 17, 2022, 3:02pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.