My question is about the Random Forest model development using sample-classifier. As I understand this tool, it uses a subset of the data to train the decision trees and then applies the trees to 100% of the data to determine its accuracy. Is this correct, and if so, what percentage of the data are used for the training set?
Incorrect. The data are split into separate training and test sets. So accuracy is only estimated on the test set, which is never seen by the model during training.
See the --p-test-size parameter
If you want to obtain classifications of all samples, see the classify-samples-ncv method. It does not produce a classifier that you can save and re-use, but it classifies all samples through nested cross-validation (essentially it performs classify-samples N times over so that a different classifier is trained each time and each sample is in the test set once and in the training set N-1 times).
Thanks for the quick response, and the clarification! Looks like the --p-test-size default value is 0.2, meaning 80% of the data are used for training and 20% for testing, correct?
Classify-samples-ncv seems like a great alternative to include all samples in the classifier, I will look into that. Thanks again!
That method is new in this release and has not been documented yet. More documentation is coming out in next month's release, but its use is similar to classify-samples note, however, that accuracy for that method is reported in stdout (i.e., printed to terminal) so use the --verbose flag if you are using command line. We do not yet have a method for visualizing the accuracy of that method (did not make it into last month's release — keep an eye out for next month's release which will have lots of goodies in store for all of sample-classifier)