sample-classifier under the hood

kam · December 31, 2024, 6:20pm

Hi!

I’m trying to gain a better understanding of how the classify_samples command from the sample_classifier plugin works. I understand that under the hood, this plugin primarily wraps the functionality of sklearn, and I’m trying to figure out how equivalent code would look in sklearn (perhaps there’s source code I haven’t found yet).

Let’s say I pick the default RandomForest. I assume that training and testing the model is straightforward using rf = RandomForestClassifier followed by rf.fit(X_train, y_train).

What I’m unsure about is how the --p-cv parameter is used. Is it applied in a repetitive manner, i.e., in a for-loop running the specified number of times? Additionally, how are the Model Accuracy and AUC parameters calculated? Are they derived as the mean/median values of the k-fold cross-validation?

Thanks!

colinbrislawn · December 31, 2024, 9:28pm

Here's the source code!

Once everyone is back after New Year's Day, we can dive into more details. Looks like Nick wrote most of this in 2018-19.