q2- longitudinal feature volatility analysis

ChrisKeefe · June 22, 2021, 4:36pm

Welcome to the forum, @George!
From the q2-longitudinal tutorial:

A supervised learning regressor is used to identify important features and assess their ability to predict sample states.

Which regressor is used will depend on which regressor you select with --p-estimator, but the RandomForestRegressor is used by default.

From the q2-sample-classifier tutorial

Another really useful output of supervised learning methods is feature selection, i.e., they report which features (e.g., ASVs or taxa) are most predictive. A list of all features, and their relative importances (or feature weights or model coefficients, depending on the learning model used), will be reported.... Features with higher importance scores were more useful.... Feature importance scores are assigned directly by the scikit-learn learning estimator that was used; more details on individual estimators and their importance scores should refer to the scikit-learn documentation. Note that some estimators — notably K-nearest neighbors models — do not report feature importance scores, so this output will be meaningless if you are using such an estimator.

Imagine you're interested in trends in the flavors of ice cream. You run a longitudinal study of ingredients (features) using available flavors as the labels for your data set. Your dominant features are probably cream, sugar, egg, etc. They are the most abundant features, but give you little ability to predict trends in flavor production, because they show up in every sample.

Vanilla, and rum are low-"frequency" features, but have much more predictive power. Of these, vanilla is less "important" than rum, because it is present in vanilla, chocolate, chocolate chip, butterscotch, and other flavors. Rum, on the other hand, is only found in Rum Raisin, and so is of high importance to a machine learning tool despite its low frequency.

This is metaphor not science, but hopefully it communicates the important bit: abundance and predictive power are not necessarily linked, and different questions may prioritize different data. "Which ingredients tell us whether this is ice cream or creme brulee?" cares more about the ratios of high-abundance features like eggs and cream, while "Which features best predict which flavor of ice cream this is?" might show low-abundance features to be more impactful.

Features can be important regardless of whether their abundance is increasing or decreasing. Looking at "net avg change" alongside importance can help you tease out which important features are decreasing in abundance.

One note - microbiome data is compositional, which counfounds statements about changes in features' true abundance. Is the feature's abundance increasing, or is the apparent increase caused by decreases in the abundance of other features? Non-compositional methods can show useful trends, but if you plan to report on changes in feature abundance, you will want to consider compositionally-aware approaches (e.g. ANCOM).

You can copy-paste pictures into the text box, or use the upload button in the menu bar.