feedback on alternative visualization for classify-sample output

As a follow up to an earlier post, I've been trying to determine how to best visualize the results from a few classify-samples-ncv outputs. This experiment contains samples that were obtained from:

  • 2 locations ("EN" and "HB")
  • 3 dates (June, July, and September)

I want to show the readers a few things:

  1. How well does a given model perform? For example, is the classifier better at assigning samples to a collection date versus a location?
  2. What are the particularly important features when generating this model? Are these features the same across the different classifiers? For example, are the most important features the same in the classifier for Date as they are for Location?

I'm pretty clear on how to address the first objective. Where I'm stuck, and where I'd like some guidance/critique, is how to best answer the second one. More on this below...

I haven't yet used any of the available downstream tools classify-samples-ncv produces, but it would be great to have any pointers on which tools apply to the three output files (feature_importance.qza, predictions.qza, and probabilities.qza).

Instead, I've manually exported the output files, and fuddled around with them in R to produce a plot shown below. To achieve my two goals stated above:

  1. Panels D-F show model performance by taking the predictions.qza file and generating a heatmap showing how often the prediction matched the actual group. The values inside each box represent the number of samples. Seems like these classifiers work really well at identifying when a sample was obtained (D), but aren't always perfect when it comes to classifying where a sample was obtained (E). When you consider both when and where (F), the classifier actually does better. I think/hope that is clear.

  2. Panels A-C show are more subjective in my mind. I started by exporting the feature_importance.qza file, then ordering and filtering by the relative importance: I gathered those features (OTUs in this case) that summed to 50%. In other words, the data shown in each of these panels represent those OTUs which account for half of the models 'importance' (however we define that). I wonder what users think about whether it make senses to use a common % approach, rather than taking a fixed integer of OTUs and explaining what proportion of "importance" they account for? And whether 50% makes any sense at all, versus say something more strict like 20% or more inclusive like 80%? What values to other use? Perhaps there are other techniques that are useful to help explain which Features are most important to a classifier.

Panels A-C are also an attempt to illustrate whether the same OTUs are important between classifiers. You can see that one group, the Trichoptera (teal color) are important for Panel B, but not panels A or C, but the Psocodea (orange color) are important for Panels A and C, but not B. This is where I think my current % filtering approach suffers. It's possible that "important for X, but not for Y or Z" is entirely a function of filtering at that 50% threshold. If I change that to 60%, the story can change. If I change it to 20%, of course it changes again. That's why I wonder if there is any other way to think about this kind of data.

Greatly appreciate your feedback and thoughts!

1 Like

Hey @devonorourke,
Pretty plots!

predictions.qza and probabilities.qza can be used as input to confusion-matrix to generate the confusion matrices and ROC curves... though the confusion matrices you generated in R look great so unless if you want ROC running this will be redundant.

Yep, there are many ways to do this... you can look at quartiles, a fixed threshold, a fixed number ("top 10"), etc. There is no "best", it all depends on the shape of the data and/or researcher preferences.

You can also filter your feature table to only contain the top features and see how this impacts classification accuracy, and define a threshold dynamically (classify-samples can do this automatically for you with the --p-optimize-feature-selection option, but classify-samples-ncv does not)

Great! So these are markers for site and not month.

Yes, but the % importance is lower... so the message is the same: these features are more important for predicting site than month.

Instead of showing a barplot for each, you could create a heatmap showing the top features for panels D, E, and F together — that way you can unambiguously show how Trichoptera have a much higher relative importance for predicting site than month or month+site. (barplots sharing the y-axis could accomplish this as well but would be messy in my opinion)

1 Like

Just to clarify here - are you suggesting that there is a single heatmap that combines all elements of D,E,F into a single image, or are there three separate heatmaps, one for D, one for E, and one for F?

Here's an example of one of the heatmaps generated by sample-classifier classsify-samples that had the --p-optimize-feature-selection parameter turned on:

That heatmap would be representative of just Panel D in the image above (it's just for "Location"). The documentation suggests that you get the top 50 features. What I wasn't clear on was the gradient color - is that just the number of sequence counts? Is it the total number of sequences across all samples, or is it the median among all samples? Something else?

If I was to construct a heatmap of the top 50 features for D, E, and F individually, there's no guarantee that they would be the same 50 features, correct? Because those top 50 features would be different for each classifier model?

Maybe that wasn't what you were suggesting at all - apologies for the confusion. I like your suggestion to look at the core features, and just grab from those FeatureIDs. If I was to examine FeatureIDs present in at least 10% of the samples, that consists of just 60 Features (and 20% of samples contain just 30 Features). So somewhere in the 10-20% range will likely give me a good sense of which diet components are frequently identified, but might be discriminatory among Location or Date (or both). Does that sound sensible?

If I was to look into the data table output from the core-features analysis, I struggle to understand the output. I think this question was already asked here. Here's a little bit of the table's output:

Feature ID      2%      9%      25%     50%     75%     91%     98%
feature1        0.0     0.0     5.5     272.0   1265.0  3478.4  5253.4
feature2        0.0     0.0     0.0     30.0    212.0   1171.4  7474.2
feature3        0.0     0.0     0.0     21.0    341.5   1298.2  3537.6

If I'm understanding your response to that other post correctly, the 2nd - Nth columns are the percentiles. But in particular, are these the percentiles in terms of sequence counts?. I couldn't find anything in the documentation that indicates what the integer values represent though. Is it the case that these values are the median number of sequences at that given percentile? For example, for feature1, the value at 50% indicates that half of my samples have more (or less) than 272 sequences? But at the 75th percentile (that is the upper quarter of samples with the greatest number of sequences), the median value is 1265 sequences?

Appreciate all the help - just want to be sure I know what I'm looking at here! :tired_face:

I am suggesting that you combine important features for D, E, F into a single visualization, whether a heatmap or some other visualization of your preference.

Also to clarify, I was suggesting a heatmap as a good visualization for this, but not qiime sample-classifier heatmap since there is no way to use that visualization to compare importances across multiple classifiers as I suggested. You would need a custom plot for this.

Looks like if "group-samples" is True, this is log10 frequency (sum of sequence counts):

Correct

I was not suggesting core features, really (and since these are presumably the same samples input to different classifiers, all features will be "core" because each feature will receive an importance score). I was just suggesting that you could group panels A-C into a single plot that shows the importance of these features across all 3 classifiers, making it clear that different features are more/less important for the different models, since my understanding of your original concern was that an arbitrary threshold could hide information.

I think your original figure is fine, and setting an arbitrary % threshold is fine too — and commonly done — so relating this to core-features is sort of going down a very deep rabbit hole!

Thanks for all the details @Nicholas_Bokulich,

I made a heatmap that follows up on an idea mentioned in this thread:

  • First, I identified the core features present in at least 10% of samples with at least 10,000 sequences. I filtered to a particular min-read threshold to match the same sample numbers present in the other diversity metrics where rarefied data was needed. There are 55 FeatureIDs present in at least 10% of these samples.
  • Next, I filtered the original data-table to retain only those 55 FeatureIDs among samples with at least 10,000 sequences.
  • Finally, I ran classify-samples-ncv on that reduced feature-table.

This visualization would replace panels A-C, as it shows the relative importance of the same set of Features across all three classifier models ("Site", "Month", and "SiteMonth"). I think this more clearly articulates that there are just a handful of relatively well observed Features that are particularly important to some classifier models over others.

Curious what you think of this strategy? I'm going to clean up the figure a bit so that I can color group the Features into their respective arthropod Orders (need to do that with Adobe) - but the general setup will be the same.

Cheers!

yep, that heatmap captures what I had in mind: it shows many of the same top features as in your barplots, but it is easier to compare how important these same features are for each classification task.

Prevalence/abundance filtering like this is fine. The 10k sequence minimum threshold is very stringent... less common but prevalent sequences are very likely to be important for specific classes (e.g., sites) so I would personally try lower thresholds, but if you find that this does not damage performance then that's informative in its own way, as you put it:

Maybe I'm misrepresenting these samples. I'm starting with samples that had a per-sample total number of reads ranging between about 120,000 to 2,000. The initial 10,000 read filter discarded 14 samples, and retained 196 of the 210 samples, so I don't think it's going to dramatically change the results if I lowered that threshold.

I did not require each core feature to have a minimum of 10,000 reads. It was simply that a sample had to have at least 10,000 reads to be considered in the core-features calculation.

Understood! That makes more sense, I misread before.

Thanks for the thoughts @Nicholas_Bokulich,

I've added another element now above the new hetamap plot that shows the relative fraction of reads for a given OTU per sample, so that you can see how these relative abundances vary by Site or Location. By plotting each sample as a data point, you can also get a sense of how often a given OTU is detected among the samples in the experiment. Not sure if it adds much value, but I think it does show the underlying data contributing to the heatmap in a clear way.

The three heatmaps at the bottom present the same kind of data as in the initial figure at the beginning of this thread, though the values are slightly different because these machine learning predictions are based only on those core features, rather than the entire data set. Not a big change, but perhaps as you might expect, the classifier models do a bit better when you provide them with more possible OTUs to classify a sample to a location/date.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.