Classify samples output - feature_importance.qza

devonorourke · April 11, 2019, 9:21pm

2 Questions about the output of the classify samples scripts

#1. The feature_importance.qza artifact produced after running classify samples consists of two columns: the ASV (feature) and "importance".

Could someone point me in the direction of what this means (from the docs)?:

--o-feature-importance
Importance of each input feature to model accuracy.

I have a sense that this is providing me with an indication of which ASVs are more important at discriminating between the group(s) when the model is being built, but I was hoping to have a better understanding of how any such value is actually derived.

#2. Does anyone have a sense of how importance abundance information is? I've been playing with both rarefied and non rarefied data and it seems like:
a. Rarefied data has more ASVs with larger Importance values (per ASV) ... and as a result ...
b. There are fewer ASVs to provide, say, something like 50% of the overall Importance

In the example plots below, there are 4 different groups that I was investigating (the horizontal facets); the same data were analyzed using either rarefied data or unrarefied data. Same samples, same ASVs. What's curious to me is how many more ASVs are part of the outcome in the unrarefied data; I'm struggling to grasp the meaning behind why so few ASVs provide a high level of discrimination among rarefied data yet not in unrarefied data... except.. well apparently for the last factor ("batch").

Thanks for the tips!

Nicholas_Bokulich · April 12, 2019, 5:28pm

Bingo

assuming you are using random forests: 1.11. Ensembles: Gradient boosting, random forests, bagging, voting, stacking — scikit-learn 1.4.1 documentation

This is so subjective I cannot answer. The model decides for itself based on the input data.

Yep, you rarefy and you lose those low-abundance ASVs, leading to a shorter tail of importance scores. More % variation is explained by a smaller number of features pretty much just because you have fewer features... note that in rarefying you may be losing some explanatory features.

I'd say stop worrying and just trust the machine

devonorourke · April 12, 2019, 6:51pm

@Nicholas_Bokulich
+1 for the help
+2 s for the program !

I would note that the plot above has nearly identical samples and ASVs. There are about 2500 ASVs and about 280 samples that were input to the sample classifier. The rarefied reads had 7 fewer samples and 84 less ASVs (and those ASVs represent just 0.03% of the total dataset read abundance).

What's odd to me is how many more ASVs appear "important" at all in the unrarefied data. I wonder if the reason is that rarefying the data is basically flattening out the low abundant reads to the same value... making the variance in abundance much less, especially for low abundance reads.

The good news is that the ASVs at the top of the list are typically all the same whether the data is rarefied or not.

Thanks again !

Nicholas_Bokulich · April 12, 2019, 6:55pm

that is really what counts for your interpretation at the end of the day, and I would focus on that information.

devonorourke · April 12, 2019, 9:02pm

So many machines... , , and Bert Kreischer.

Thanks for the help (as usual!) @Nicholas_Bokulich

system · May 14, 2019, 3:02am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.