classify-samples features used in training

After running classify-samples I generate a few different artifacts, including:

  • feature_importance.qza
  • sample_estimator.qza
  • model_summary.qzv

Hopefully I'm correct in my interpretation that by exporting the feature_importance.qza artifact, I'm looking at the list of all features that the classifier used in it's decision trees to generate the model. This is a subset of the entirety of my features, which makes sense given that model training was working on a subset of my samples.

I can also export the predictions.qza artifact to get a list of the samples. I wanted to double check that those samples represent the samples withheld from the model training, and are instead the samples used to test the accuracy of the classifier. Is that correct?

What I'm looking for doesn't appear to be directly inside any of these artifacts: a list of the features (not samples) that would have been withheld from the model training. The reason for this is based on the plot below...

I was curious to try to understand whether the important features identified by the classifier were generally ASVs that were either highly abundant or at least detected frequently among samples. This experiment had samples collected at two different locations (EN and HB), during three different months (June, July, September). I labeled the ASVs in a manner such that any feature that was identified as important to the classifier were solid black circles, with text labels on the right hand side of the plot; any ASVs not identified by the classifier (that is, not listed in the feature_importance.qza artifact) are square boxes and text labels are along the left edge of the plot.

For the most part, the points you see shifting around with respect to ASV occurrence (x axis) or sequence abundance (y axis) are nearly always those identified by the classifier (good!). But there are a few strange instances where a feature appears to be just as distinct, yet is marked by a square box, meaning it wasn't listed by the classifier (ex. ASV-23|Helius).

What I got to wondering is if the reason why that ASV isn't listed is not because the classifier didn't think it was relevant, but rather, that particular ASV was only in the subset of samples that was used in model accuracy testing (and was not part of the training subset of samples). Is that possible?

My strategy at the moment is to take the list of samples used in the predictions.qza file to identify the samples used in the training of the model instead, then look at those training samples to figure out whether or not those curious ASVs not flagged by the classifier were missing. My big question: whether or not there is a way to identify the ASVs that were used as inputs into model training, no matter whether they were identified as important to the classifier or not.

thanks for the help!



That would be all the features in the input feature table. If you want to see only those that were deemed unimportant, you can use filter-features to remove the important features from your feature table and ogle the rest.

It’s possible but unlikely since the majority of samples are being used for training. On a related note, though, if a feature is not present in many samples it may not be considered important just because it is not especially informative of any given class; other features are ultimately more discriminative.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.