Negative learning curves, sample-classifier

Jeremie_Auger · July 20, 2020, 8:10pm

I have a dataset of people who all have undergone a dietary intervention. Not all participants gave both samples, so I excluded them and only kept those for which I have Visit1 and Visit7, leaving a decent n=54 for both groups. And of course expecting predictions better than the 50% luck threshold.

But I get the opposite!

I have tried all that came in in mind, like trying the other algorithms, setting other parameters (ex. seed, etc.). I have ran it on the L6_table (collapsed at genus level) and got this, but also did with the ASV table. Very similar results.

I join a .zip of my working tables and (simplified) metadata + taxonomy (ant the .txt table for verifications..) if you some people want to try for themselves.
BugReport-ML_NegativeLearning.zip (2.3 MB)

Thanks in advance,
Best regards, -JA

Nicholas_Bokulich · July 20, 2020, 9:27pm

Hi @Jeremie_Auger,

ROC curves below the diagonal happen (just do a quick internet search to get copious examples!)... basically what this result is telling you is that the sample classifier you trained cannot differentiate visits any better than random chance. This is not a bug — it's a characteristic of the data (V1 and V7 look the same to this classifier).

I hope that helps.

Jeremie_Auger · July 21, 2020, 6:36pm

Hi,

I know it means the data is not easily distinguishable on that label. But at the same time, it does soooo bad it can't be by simple luck! Like this attempt on the ASV table, yielding an accuracy ratio of 0.54 :

I mean, it's got to be picking up some trend or something!

And I am not saying the 'sample-classifier' module is bad; I actually love it. And appreciate the recent heatmap.qzv addition, thanks a lot guys (and gals) at qiime!

Maybe I can revert the predictor or something?

Thanks,
Best regards, -Jérémie

Nicholas_Bokulich · July 21, 2020, 6:48pm

those newer results look much better — note the range on the color scale vs. the first results you sent.

I see it's the same classes from the first post, though — is your concern that there is variation in accuracy each time you run it? If that's the case, you might want to try classify-samples-ncv, which will use cross-validation to predict each sample, as well as reporting the variation in accuracy across folds (unfortunately that just prints to stdout, it is not reported in the provenance or elsewhere).

Jeremie_Auger · July 22, 2020, 2:43pm

Hi,

I tried:

(qiime2-2019.10) jeremieauger@x86_64-apple-darwin13 classify-samples-ncv % qiime sample-classifier classify-samples-ncv
--i-table L6_table.qza
--m-metadata-file sample-metadata.tsv
--m-metadata-column "Treatment"
--output-dir ncv

Saved SampleData[ClassifierPredictions] to: ncv/predictions.qza
Saved FeatureData[Importance] to: ncv/feature_importance.qza
Saved SampleData[Probabilities] to: ncv/probabilities.qza

And it seems like it worked. I also join a zip of the files it produced. I really don't know what to make of those.

ncv.zip (141.8 KB)

Which I unzipped the 'qza's but I can't make much sense of it.

Thanks, -JA

Nicholas_Bokulich · July 22, 2020, 2:51pm

I recommend reading this tutorial for more details:
https://docs.qiime2.org/2020.6/tutorials/sample-classifier/#nested-cross-validation-provides-predictions-for-all-samples

Good luck!

system · August 22, 2020, 9:05pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.