Fit Classifier Sklearn Specification String

Is there any documentation on what form and contents the classifier specification string should have when using the fit_classifier_sklearn plugin?

Hi @ezke - we are pinging @BenKaehler, the maintainer of q2-feature-classifier for his help on this. Thanks for your patience!

Thanks @ezke for the interest.

No, there is no documentation. Here is a quick how-to.

The classifier-specification should be a serialised sklearn.pipeline.Pipeline object. The pipeline should accept iterables of strings as the X and y inputs to its fit function. The X strings will be DNA sequences and the y strings will be the corresponding taxonomy descriptions.

How do you serialise a pipeline that you might like to use? The easiest way is to create it in Python, then use the helper function in the q2_feature_classifier API to serialise it. For instance the following code creates a serialised version of the classifier that is provided by fit-classifier-naive-bayes. The classifier-specification is in the final variable classifier_specification. You can then edit the classifier_specification if you wish.

In [2]: from q2_feature_classifier.classifier import spec_from_pipeline

In [3]: from sklearn.feature_extraction.text import HashingVectorizer

In [4]: from sklearn.naive_bayes import MultinomialNB

In [5]: from sklearn.pipeline import Pipeline

In [11]: from json import dumps

In [6]: steps = [('feat_ext',
   ...:           HashingVectorizer(analyzer='char_wb', n_features=8192,
   ...:                             ngram_range=[8,8], alternate_sign=False)),
   ...:          ('classify',
   ...:           MultinomialNB(alpha=0.01, fit_prior=False))]

In [7]: pipeline = Pipeline(steps=steps)

In [8]: spec = spec_from_pipeline(pipeline)

In [10]: spec
  {'__type__': 'feature_extraction.text.HashingVectorizer',
   'alternate_sign': False,
   'analyzer': 'char_wb',
   'binary': False,
   'decode_error': 'strict',
   'encoding': 'utf-8',
   'input': 'content',
   'lowercase': True,
   'n_features': 8192,
   'ngram_range': [8, 8],
   'non_negative': False,
   'norm': 'l2',
   'preprocessor': None,
   'stop_words': None,
   'strip_accents': None,
   'token_pattern': '(?u)\\b\\w\\w+\\b',
   'tokenizer': None}],
  {'__type__': 'naive_bayes.MultinomialNB',
   'alpha': 0.01,
   'class_prior': None,
   'fit_prior': False}]]

In [14]: classifier_specification = dumps(spec, indent=2)

In [15]: print(classifier_specification)
      "non_negative": false,
      "__type__": "feature_extraction.text.HashingVectorizer",
      "encoding": "utf-8",
      "ngram_range": [
      "strip_accents": null,
      "alternate_sign": false,
      "n_features": 8192,
      "lowercase": true,
      "tokenizer": null,
      "decode_error": "strict",
      "stop_words": null,
      "input": "content",
      "token_pattern": "(?u)\\b\\w\\w+\\b",
      "analyzer": "char_wb",
      "binary": false,
      "norm": "l2",
      "preprocessor": null
      "fit_prior": false,
      "__type__": "naive_bayes.MultinomialNB",
      "class_prior": null,
      "alpha": 0.01

(It is not exactly the same as fit-classifier-naive-bayes, because it uses a custom subclass of MultinomialNB that chunks the input so that training a SILVA classifier doesn’t require > 300GB of RAM, but that’s the only difference.)

Finally, for security reasons, q2_feature_classifier won’t instantiate any pipeline that contains objects that aren’t from scikit-learn or the q2_feature_classifier.custom module.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.

An off-topic reply has been split into a new topic: Fit-classifier-sklearn parameter

Please keep replies on-topic in the future.