I have been using different regression techniques to make predictions based on a taxonomy feature table. My approach is to:
- Divide my entire dataset in half and keep one half "sequestered" as the unknown set
- Train and save a regression model on the known set
- Test the model on the unknown set
With random forest regression, the model's ability to make predictions on the unknown set isn't great, but at least it's sensible. Here is a picture of a representative scatterplot of predictions made on the unknown set after building the model with the known set:
However, when I try using some of the other regression modules (e.g. linearSVR), the predictions that come out when I feed the model my unknown set are wildly inappropriate. Below is the corresponding scatterplot for linearSVR. All of the negative predictions don't make sense to me (the values in the entire dataset are all positive integers between 0 and 100).
My question is: is there a way within Qiime2 to scale or otherwise constrain these regression models so that the predictions are within some pre-specified bounds? Many thanks.
You are correct — this is a scaling issue. LinearSVMs and some other models are very sensitive to this and technically the data should be scaled for these models to provide appropriate predictions on some datasets. The good news is that most of the estimators are not sensitive and can be used in the current version.
This has been on my radar for a long time now — it’s the first and oldest issue in the repo. We also have a pull request that will partially fix this (by supporting different feature table types, some of which are scaled types), but that is blocked by forthcoming developments in the framework.
I do not have an ETA on when I can tackle this — you are actually the first user to report issues with this, and I appreciate your input. If you wanted to contribute a pull request adding some scikit-learn pre-processing options I would love the help . Otherwise, this is in the works and I will update you on this post when there are developments.
I would also be open to a PR allowing data to be constrained within rational limits.
Thanks, @Nicholas_Bokulich. I’m glad the question makes sense; sounds like you’ve been pondering it too.
Unfortunately, I’m not the guy you want to work on a PR–that’s well outside my training. But I will do some empiric testing of the different regression modules with my data to see which ones are the most forgiving around this scaling issue.
Thanks! Let me know what you find.
This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.