Is there a way to get information from sample classifier output on number of samples used for training the model?
I am aware that it is possible to obtain information on number of samples used for evaluation of the model- located in predictions.qza artifact.
However, It would be beneficial also to have the same information on number of samples going into algorithm for training purposes, especially for subsequent reporting.
Hi @Deni_Ribicic ,
This info could be obtained in a few indirect ways:
number of training samples = total number of samples - # of evaluation samples
number of training samples = total number of samples * (1 - test_size)
if using a classifier, count the number of samples in the training_targets output.
if using one of the nested classifiers, all samples are used for training an n-1 folds, and testing in 1 fold. The number of training samples per fold will be calculated as in step #2 above.
This info could also be simply added to the model_summary outputs if you want a fast + explicit place to look, and are interested in contributing to the source code
It would be impossible to register this information in provenance directly (since it is not set by a separate parameter), but the test_size parameter is how this information is effectively stored in provenance (assuming that the total number of samples is also known).
I am ashamed to admit I haven't updated qiime2 for a while , hence did not have training_targets as standard output from the run. This solves my problem indeed.
But just out of curiosity, how would one obtain the same info when predicting numerical (continuous) data (regress-sample) (I can see there is no training_targets output here). Let's say a huge dataset (100s of samples), with metadata missing some values, so it becomes a more tedious task to track how many samples in total one has going into model (therefore your suggestion #1 wouldn't be that optimal, I assume)
Agreed, this is less straightforward with missing values but still traceable for reporting purposes, provided that you did not update the metadata file to replace some values. I am not sure why regress_samples does not output training_targets... this could probably be corrected easily enough.
Quick update: I opened an issue for this on GitHub, so this functionality will be added in a future release of q2-sample-classifier. You can watch that issue on GitHub if you are interested in tracking progress on this (and as always contributions to the source code are welcome ).