Is there any documentation on what form and contents the classifier specification string should have when using the fit_classifier_sklearn plugin?
Hi @ezke - we are pinging @BenKaehler, the maintainer of q2-feature-classifier
for his help on this. Thanks for your patience!
Thanks @ezke for the interest.
No, there is no documentation. Here is a quick how-to.
The classifier-specification
should be a serialised sklearn.pipeline.Pipeline
object. The pipeline should accept iterables of strings as the X
and y
inputs to its fit
function. The X
strings will be DNA sequences and the y
strings will be the corresponding taxonomy descriptions.
How do you serialise a pipeline that you might like to use? The easiest way is to create it in Python, then use the helper function in the q2_feature_classifier
API to serialise it. For instance the following code creates a serialised version of the classifier that is provided by fit-classifier-naive-bayes
. The classifier-specification
is in the final variable classifier_specification
. You can then edit the classifier_specification
if you wish.
In [2]: from q2_feature_classifier.classifier import spec_from_pipeline
In [3]: from sklearn.feature_extraction.text import HashingVectorizer
In [4]: from sklearn.naive_bayes import MultinomialNB
In [5]: from sklearn.pipeline import Pipeline
In [11]: from json import dumps
In [6]: steps = [('feat_ext',
...: HashingVectorizer(analyzer='char_wb', n_features=8192,
...: ngram_range=[8,8], alternate_sign=False)),
...: ('classify',
...: MultinomialNB(alpha=0.01, fit_prior=False))]
...:
In [7]: pipeline = Pipeline(steps=steps)
In [8]: spec = spec_from_pipeline(pipeline)
In [10]: spec
Out[10]:
[['feat_ext',
{'__type__': 'feature_extraction.text.HashingVectorizer',
'alternate_sign': False,
'analyzer': 'char_wb',
'binary': False,
'decode_error': 'strict',
'encoding': 'utf-8',
'input': 'content',
'lowercase': True,
'n_features': 8192,
'ngram_range': [8, 8],
'non_negative': False,
'norm': 'l2',
'preprocessor': None,
'stop_words': None,
'strip_accents': None,
'token_pattern': '(?u)\\b\\w\\w+\\b',
'tokenizer': None}],
['classify',
{'__type__': 'naive_bayes.MultinomialNB',
'alpha': 0.01,
'class_prior': None,
'fit_prior': False}]]
In [14]: classifier_specification = dumps(spec, indent=2)
In [15]: print(classifier_specification)
[
[
"feat_ext",
{
"non_negative": false,
"__type__": "feature_extraction.text.HashingVectorizer",
"encoding": "utf-8",
"ngram_range": [
8,
8
],
"strip_accents": null,
"alternate_sign": false,
"n_features": 8192,
"lowercase": true,
"tokenizer": null,
"decode_error": "strict",
"stop_words": null,
"input": "content",
"token_pattern": "(?u)\\b\\w\\w+\\b",
"analyzer": "char_wb",
"binary": false,
"norm": "l2",
"preprocessor": null
}
],
[
"classify",
{
"fit_prior": false,
"__type__": "naive_bayes.MultinomialNB",
"class_prior": null,
"alpha": 0.01
}
]
]
(It is not exactly the same as fit-classifier-naive-bayes
, because it uses a custom subclass of MultinomialNB that chunks the input so that training a SILVA classifier doesn’t require > 300GB of RAM, but that’s the only difference.)
Finally, for security reasons, q2_feature_classifier
won’t instantiate any pipeline that contains objects that aren’t from scikit-learn
or the q2_feature_classifier.custom
module.
This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.
An off-topic reply has been split into a new topic: Fit-classifier-sklearn parameter
Please keep replies on-topic in the future.