Currently I am playing with sample-classifier and try to predict metadata categorical values with sample-classifier.
This pipeline takes as input table (frequency) with absolute values.
My question is how (if) counts are normalized before the classifier training. Is it making sense to use relative frequencies or another normalization output instead?
Hi @timanix ,
No normalization is done by this action. The idea is to eventually implement separate normalization actions that could then pass normalized tables to this function or elsewhere. However, it is not so simple
Not necessarily. It depends on the properties of the classifier. Many normalization methods for compositional data were designed for differential abundance tests, and their appropriate application to supervised classification problems is still an open question (see https://doi.org/10.1093/gigascience/giz107).
Knights et al. (https://academic.oup.com/femsre/article/35/2/343/661201) recommend rarefying prior to classification to avoid introducing library size biases, so this is one option (as a rarefied table is still
Thank you for the answer!
I will rarefy the tables, since the only variable I can predict now with high confidence is sequencing run
Meanwhile the factor of the interest is giving me only 0.4 accuracy with 3 levels.
I will also try to run it with DESeq2-like normalisation (rounded) to see if it will affect the model.