Metadata sample classifier- information on number of samples for training purposes

Hi there,

Is there a way to get information from sample classifier output on number of samples used for training the model?

I am aware that it is possible to obtain information on number of samples used for evaluation of the model- located in predictions.qza artifact.
However, It would be beneficial also to have the same information on number of samples going into algorithm for training purposes, especially for subsequent reporting.

Hi @Deni_Ribicic ,
This info could be obtained in a few indirect ways:

  1. number of training samples = total number of samples - # of evaluation samples
  2. number of training samples = total number of samples * (1 - test_size)
  3. if using a classifier, count the number of samples in the training_targets output.
  4. if using one of the nested classifiers, all samples are used for training an n-1 folds, and testing in 1 fold. The number of training samples per fold will be calculated as in step #2 above.

This info could also be simply added to the model_summary outputs if you want a fast + explicit place to look, and are interested in contributing to the source code :wink:

It would be impossible to register this information in provenance directly (since it is not set by a separate parameter), but the test_size parameter is how this information is effectively stored in provenance (assuming that the total number of samples is also known).

Hi @Nicholas_Bokulich ,

Thanks for the input.

I am ashamed to admit I haven't updated qiime2 for a while :flushed:, hence did not have training_targets as standard output from the run. This solves my problem indeed.

But just out of curiosity, how would one obtain the same info when predicting numerical (continuous) data (regress-sample) (I can see there is no training_targets output here). Let's say a huge dataset (100s of samples), with metadata missing some values, so it becomes a more tedious task to track how many samples in total one has going into model (therefore your suggestion #1 wouldn't be that optimal, I assume)

1 Like

Point number 1 or 2 above.

Agreed, this is less straightforward with missing values but still traceable for reporting purposes, provided that you did not update the metadata file to replace some values. I am not sure why regress_samples does not output training_targets... this could probably be corrected easily enough.

Hi @Deni_Ribicic ,

Quick update: I opened an issue for this on GitHub, so this functionality will be added in a future release of q2-sample-classifier. You can watch that issue on GitHub if you are interested in tracking progress on this (and as always contributions to the source code are welcome :wink: ).

Thanks for bringing this up!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.