Hi there,
Is there a way to get information from sample classifier output on number of samples used for training the model?
I am aware that it is possible to obtain information on number of samples used for evaluation of the model- located in predictions.qza artifact.
However, It would be beneficial also to have the same information on number of samples going into algorithm for training purposes, especially for subsequent reporting.
Hi @Deni_Ribicic ,
This info could be obtained in a few indirect ways:
- number of training samples = total number of samples - # of evaluation samples
- number of training samples = total number of samples * (1 - test_size)
- if using a classifier, count the number of samples in the
training_targets output.
- if using one of the nested classifiers, all samples are used for training an n-1 folds, and testing in 1 fold. The number of training samples per fold will be calculated as in step #2 above.
This info could also be simply added to the model_summary outputs if you want a fast + explicit place to look, and are interested in contributing to the source code 
It would be impossible to register this information in provenance directly (since it is not set by a separate parameter), but the test_size parameter is how this information is effectively stored in provenance (assuming that the total number of samples is also known).
Hi @Nicholas_Bokulich ,
Thanks for the input.
I am ashamed to admit I haven't updated qiime2 for a while
, hence did not have training_targets as standard output from the run. This solves my problem indeed.
But just out of curiosity, how would one obtain the same info when predicting numerical (continuous) data (regress-sample) (I can see there is no training_targets output here). Let's say a huge dataset (100s of samples), with metadata missing some values, so it becomes a more tedious task to track how many samples in total one has going into model (therefore your suggestion #1 wouldn't be that optimal, I assume)
Point number 1 or 2 above.
Agreed, this is less straightforward with missing values but still traceable for reporting purposes, provided that you did not update the metadata file to replace some values. I am not sure why regress_samples does not output training_targets... this could probably be corrected easily enough.
Hi @Deni_Ribicic ,
Quick update: I opened an issue for this on GitHub, so this functionality will be added in a future release of q2-sample-classifier. You can watch that issue on GitHub if you are interested in tracking progress on this (and as always contributions to the source code are welcome
).
Thanks for bringing this up!