Negative learning curves, sample-classifier

I have a dataset of people who all have undergone a dietary intervention. Not all participants gave both samples, so I excluded them and only kept those for which I have Visit1 and Visit7, leaving a decent n=54 for both groups. And of course expecting predictions better than the 50% luck threshold.

But I get the opposite!

I have tried all that came in in mind, like trying the other algorithms, setting other parameters (ex. seed, etc.). I have ran it on the L6_table (collapsed at genus level) and got this, but also did with the ASV table. Very similar results.

I join a .zip of my working tables and (simplified) metadata + taxonomy (ant the .txt table for verifications…) if you some people want to try for themselves. (2.3 MB)

Thanks in advance,
Best regards, -JA

Hi @Jeremie_Auger,

ROC curves below the diagonal happen (just do a quick internet search to get copious examples!)… basically what this result is telling you is that the sample classifier you trained cannot differentiate visits any better than random chance. This is not a bug — it’s a characteristic of the data (V1 and V7 look the same to this classifier).

I hope that helps.

1 Like


I know it means the data is not easily distinguishable on that label. But at the same time, it does soooo bad it can’t be by simple luck! Like this attempt on the ASV table, yielding an accuracy ratio of 0.54 :

I mean, it’s got to be picking up some trend or something!

And I am not saying the ‘sample-classifier’ module is bad; I actually love it. And appreciate the recent heatmap.qzv addition, thanks a lot guys (and gals) at qiime!

Maybe I can revert the predictor or something?

Best regards, -Jérémie

1 Like

those newer results look much better — note the range on the color scale vs. the first results you sent.

I see it’s the same classes from the first post, though — is your concern that there is variation in accuracy each time you run it? If that’s the case, you might want to try classify-samples-ncv, which will use cross-validation to predict each sample, as well as reporting the variation in accuracy across folds (unfortunately that just prints to stdout, it is not reported in the provenance or elsewhere).


I tried:

(qiime2-2019.10) [email protected]_64-apple-darwin13 classify-samples-ncv % qiime sample-classifier classify-samples-ncv
–i-table L6_table.qza
–m-metadata-file sample-metadata.tsv
–m-metadata-column “Treatment”
–output-dir ncv

Saved SampleData[ClassifierPredictions] to: ncv/predictions.qza
Saved FeatureData[Importance] to: ncv/feature_importance.qza
Saved SampleData[Probabilities] to: ncv/probabilities.qza

And it seems like it worked. I also join a zip of the files it produced. I really don’t know what to make of those. (141.8 KB)

Which I unzipped the 'qza’s but I can’t make much sense of it.

Thanks, -JA

I recommend reading this tutorial for more details:

Good luck!