feedback on alternative visualization for classify-sample output

devonorourke · August 21, 2020, 8:24pm

Just to clarify here - are you suggesting that there is a single heatmap that combines all elements of D,E,F into a single image, or are there three separate heatmaps, one for D, one for E, and one for F?

Here's an example of one of the heatmaps generated by sample-classifier classsify-samples that had the --p-optimize-feature-selection parameter turned on:
feature-table-heatmap

That heatmap would be representative of just Panel D in the image above (it's just for "Location"). The documentation suggests that you get the top 50 features. What I wasn't clear on was the gradient color - is that just the number of sequence counts? Is it the total number of sequences across all samples, or is it the median among all samples? Something else?

If I was to construct a heatmap of the top 50 features for D, E, and F individually, there's no guarantee that they would be the same 50 features, correct? Because those top 50 features would be different for each classifier model?

Maybe that wasn't what you were suggesting at all - apologies for the confusion. I like your suggestion to look at the core features, and just grab from those FeatureIDs. If I was to examine FeatureIDs present in at least 10% of the samples, that consists of just 60 Features (and 20% of samples contain just 30 Features). So somewhere in the 10-20% range will likely give me a good sense of which diet components are frequently identified, but might be discriminatory among Location or Date (or both). Does that sound sensible?

If I was to look into the data table output from the core-features analysis, I struggle to understand the output. I think this question was already asked here. Here's a little bit of the table's output:

Feature ID      2%      9%      25%     50%     75%     91%     98%
feature1        0.0     0.0     5.5     272.0   1265.0  3478.4  5253.4
feature2        0.0     0.0     0.0     30.0    212.0   1171.4  7474.2
feature3        0.0     0.0     0.0     21.0    341.5   1298.2  3537.6

If I'm understanding your response to that other post correctly, the 2nd - Nth columns are the percentiles. But in particular, are these the percentiles in terms of sequence counts?. I couldn't find anything in the documentation that indicates what the integer values represent though. Is it the case that these values are the median number of sequences at that given percentile? For example, for feature1, the value at 50% indicates that half of my samples have more (or less) than 272 sequences? But at the 75th percentile (that is the upper quarter of samples with the greatest number of sequences), the median value is 1265 sequences?

Appreciate all the help - just want to be sure I know what I'm looking at here!