hybrid classifier question

devonorourke · July 28, 2020, 10:07am

Hi all,

After finishing up my first attempt at using the hybrid vsearch/sklearn feature classifier I was surprised to find that every single sequence was classified using sklearn (~2,600 unique sequences) - not single vsearch exact match!?

After looking at the code governing the classifier actions I think my confusion rests on how I had misinterpreted what "exact" might mean. I was hoping that if the vsearch alignment of my sequence feature was 100% identical to a reference sequence it would be retained, but that's not quite right. It sounds like the reference and query must be identical not only in sequence composition, but also identical in length. For those of us using query sequences that are shorter than the reference sequences, I'm guessing this current approach isn't what is desired.

Instead, I was hoping to modify this hybrid classifier so that the user could either (1) use the existing, faster, exact match approach, but, on the chance that this won't work, then (2) it would be possible for a user to input the typical --p-perc-identity and --p-query-cov parameters of classify-consensus-vsearch. Obviously the latter approach will be slower, but at least it's still a hybrid classifier that lets a user work with a mixture of sequence lengths in a reference database. One benefit is that by including those two parameters, you can now not only perform exact alignments, but you can perform a hybrid classification using whatever alignment parameters you want - maybe you want 99% alignment over 98% query coverage, followed by LCA consensus, then a hybrid classifier to kick in?

There's an existing complication, however. The current hybrid approach includes an optional pre-filter step that relies on those same very --p-perc-identity and --p-query-cov parameters I'd like to incorporate. If a modification along the lines of what I was thinking was made possible, it would likely either require removing the pre-filter step, or renaming one of redundant terms. I'd vote for keeping the pre-filtering option, but instead amend the parameters specific to the pre-filtering to be --p-pre-query-cov and --p-pre-perc-identity.

Thanks @Nicholas_Bokulich for the new tool!

Nicholas_Bokulich · July 28, 2020, 1:49pm

Correct.

Correct

Sure! This method is a fairly simple pipeline under the hood, so it should be straightforward to modify... it would even be possible to modify the existing method to choose whether to use classify-sklearn or classify-consensus-vsearch at stage 2. Contributions are very welcome

I think your re-named parameters option makes sense.

Thanks @devonorourke!

devonorourke · July 28, 2020, 5:58pm

If @Nicholas_Bokulich or anyone else might be hosting office hours for those of us who are Python deficient, please let me know !

For now I'm thinking it'll be vaster to just run VSEARCH first, subset those feature IDs that didn't get called by VSEARCH within my parameters, then pass those into the sklearn classifier. Glad to know I was on the right track though with respect to understanding the existing functions.

Thanks!

Nicholas_Bokulich · July 28, 2020, 6:34pm

yep, that's all the hybrid classifier is doing under the hood.

classification 1 (exact match):

github.com

qiime2/q2-feature-classifier/blob/2b3fa82ac982f625eac97b28dd131069bba6bab3/q2_feature_classifier/_vsearch.py#L110-L118


      
          # find exact matches, perform LCA consensus classification
          taxa1, = ccv(query=query, reference_reads=reference_reads,
                       reference_taxonomy=reference_taxonomy, maxaccepts=maxaccepts,
                       strand=strand, min_consensus=min_consensus,
                       search_exact=True, threads=threads, maxhits=maxhits,
                       maxrejects=maxrejects, output_no_hits=True)
          
          # Annotate taxonomic assignments with classification method
          taxa1 = _annotate_method(taxa1, 'VSEARCH')

filter unassigned:

github.com

qiime2/q2-feature-classifier/blob/2b3fa82ac982f625eac97b28dd131069bba6bab3/q2_feature_classifier/_vsearch.py#L120-L128


      
          # perform second pass classification on unassigned taxa
          # filter out unassigned seqs
          try:
              query, = filter_seqs(sequences=query, taxonomy=taxa1,
                                   include=_get_default_unassignable_label())
          except ValueError:
              # get ValueError if all sequences are filtered out.
              # so if no sequences are unassigned, return exact match results
              return taxa1

reclassify and merge:

github.com

qiime2/q2-feature-classifier/blob/2b3fa82ac982f625eac97b28dd131069bba6bab3/q2_feature_classifier/_vsearch.py#L130-L139


      
          # classify with sklearn classifier
          taxa2, = cs(reads=query, classifier=classifier,
                      reads_per_batch=reads_per_batch, n_jobs=threads,
                      confidence=confidence, read_orientation=read_orientation)
          
          # Annotate taxonomic assignments with classification method
          taxa2 = _annotate_method(taxa2, 'sklearn')
          
          # merge into one big happy result
          taxa, = merge(data=[taxa2, taxa1])

So this could be done as individual steps with QIIME 2/q2-feature-classifier, or with a bit of rewiring the pipeline could be configured to do this (only minimal python experience would be needed, I promise it wouldn't hurt! )

devonorourke · July 28, 2020, 6:36pm

I'm trying this now with PyCharm in hopes it'll raise a flag when I mess something up.

My first attempt is to just add in additional parameters. Would you recommend a certain naming convention to discriminate between the initial prefilter VSEARCH step and the second VSEARCH step I'm revising?

There would be redundant query coverage and percent identity terms. I was going to use --prefilter-query-cov and --prefilter-perc-identity for the initial step, then --main-query-cov and --main-perc-identity for the follow up step.

Any other preference?

I'm also trying to figure out how to add in another if else statement that works similarly to the if prefilter statement. I wanted to have an option for a user to toggle on or off the exact match state, so that either:

by default, the secondary alignment is an exact match, as currently implemented,
if the user turns this off, the program will switch to a set of default --main-perc-identity and --main-query-cov values, and run it that way, or
if the user doesn't turn off the exact option, but does enter (a) main-perc-identity and/or main-query-cov parameter(s), then the exact function isn't run

Nicholas_Bokulich · July 28, 2020, 6:40pm

just:
--p-prefilter-query-cov
--p-prefilter-query-cov
and
--p-query-cov
--p-perc-identity

devonorourke · July 28, 2020, 7:21pm

Apologies for the silly question, but I'm curious what the appropriate steps are:

I've forked the repo.
I've modified the code.
...
test?

I'm stuck on how to test. Would I create a new Conda/QIIME environment, then manually edit the _vsearch.py script to reflect my updates, then try it out on a sample data set? What's the smarter way to do this?

p.s. I think you can see what I've got so far here

thanks!

Nicholas_Bokulich · July 28, 2020, 8:38pm

Testing is where writing a QIIME 2 plugin/method gets a lot more advanced. q2-feature-classifier has a suite of unit tests that will automatically run when a pull request is made to the qiime2 repository, or when triggered by a local testing framework like nosetests or pytest. You can add a new test function to the existing tests for the hybrid classifier:

github.com

qiime2/q2-feature-classifier/blob/2b3fa82ac982f625eac97b28dd131069bba6bab3/q2_feature_classifier/tests/test_consensus_assignment.py#L98


      
                                                      weak_id=0.8, perc_identity=0.99,
                                                      output_no_hits=False)
                  res = result.Taxon.to_dict()
                  tax = self.taxonomy.to_dict()
                  right = 0.
                  for taxon in res:
                      right += tax[taxon].startswith(res[taxon])
                  self.assertGreater(right/len(res), 0.5)
          
          
          class HybridClassiferTests(FeatureClassifierTestPluginBase):
              package = 'q2_feature_classifier.tests'
          
              def setUp(self):
                  super().setUp()
                  taxonomy = Artifact.import_data(
                      'FeatureData[Taxonomy]', self.get_data_path('taxonomy.tsv'))
                  self.taxonomy = taxonomy.view(pd.Series)
                  self.taxartifact = taxonomy
                  # TODO: use `Artifact.import_data` here once we have a transformer
                  # for DNASequencesDirectoryFormat -> DNAFASTAFormat

This is somewhat more difficult than just shuffling around pieces in the pipeline like what you've done so far (good work by the way!) but if you're up for an afternoon of learning (a.k.a. banging your head against the wall) you will reached a new level in your python programming abilities.

Writing a simple unit test and cleaning up your method are the requirements to contribute your changes to the source code... but all of this is probably way more than you set out to accomplish! So if you just want to make something that will work locally, I recommend just testing out manually with some test data as you have described.

system · August 29, 2020, 2:38am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.