hybrid classifier question

Hi all,

After finishing up my first attempt at using the hybrid vsearch/sklearn feature classifier I was surprised to find that every single sequence was classified using sklearn (~2,600 unique sequences) - not single vsearch exact match!?

After looking at the code governing the classifier actions I think my confusion rests on how I had misinterpreted what “exact” might mean. I was hoping that if the vsearch alignment of my sequence feature was 100% identical to a reference sequence it would be retained, but that’s not quite right. It sounds like the reference and query must be identical not only in sequence composition, but also identical in length. For those of us using query sequences that are shorter than the reference sequences, I’m guessing this current approach isn’t what is desired.

Instead, I was hoping to modify this hybrid classifier so that the user could either (1) use the existing, faster, exact match approach, but, on the chance that this won’t work, then (2) it would be possible for a user to input the typical --p-perc-identity and --p-query-cov parameters of classify-consensus-vsearch. Obviously the latter approach will be slower, but at least it’s still a hybrid classifier that lets a user work with a mixture of sequence lengths in a reference database. One benefit is that by including those two parameters, you can now not only perform exact alignments, but you can perform a hybrid classification using whatever alignment parameters you want - maybe you want 99% alignment over 98% query coverage, followed by LCA consensus, then a hybrid classifier to kick in?

There’s an existing complication, however. The current hybrid approach includes an optional pre-filter step that relies on those same very --p-perc-identity and --p-query-cov parameters I’d like to incorporate. If a modification along the lines of what I was thinking was made possible, it would likely either require removing the pre-filter step, or renaming one of redundant terms. I’d vote for keeping the pre-filtering option, but instead amend the parameters specific to the pre-filtering to be --p-pre-query-cov and --p-pre-perc-identity.

Thanks @Nicholas_Bokulich for the new tool!

Correct.

Correct

Sure! This method is a fairly simple pipeline under the hood, so it should be straightforward to modify… it would even be possible to modify the existing method to choose whether to use classify-sklearn or classify-consensus-vsearch at stage 2. Contributions are very welcome :wink:

I think your re-named parameters option makes sense.

Thanks @devonorourke!

1 Like

If @Nicholas_Bokulich or anyone else might be hosting office hours for those of us who are Python deficient, please let me know :man_facepalming:t3:!

For now I’m thinking it’ll be vaster to just run VSEARCH first, subset those feature IDs that didn’t get called by VSEARCH within my parameters, then pass those into the sklearn classifier. Glad to know I was on the right track though with respect to understanding the existing functions.

Thanks!

1 Like

yep, that’s all the hybrid classifier is doing under the hood.

classification 1 (exact match):

filter unassigned:

reclassify and merge:

So this could be done as individual steps with QIIME 2/q2-feature-classifier, or with a bit of rewiring the pipeline could be configured to do this (only minimal python experience would be needed, I promise it wouldn’t hurt! :snake: )

I’m trying this now with PyCharm in hopes it’ll raise a flag when I mess something up.

My first attempt is to just add in additional parameters. Would you recommend a certain naming convention to discriminate between the initial prefilter VSEARCH step and the second VSEARCH step I’m revising?

There would be redundant query coverage and percent identity terms. I was going to use --prefilter-query-cov and --prefilter-perc-identity for the initial step, then --main-query-cov and --main-perc-identity for the follow up step.

Any other preference?

I’m also trying to figure out how to add in another if else statement that works similarly to the if prefilter statement. I wanted to have an option for a user to toggle on or off the exact match state, so that either:

  1. by default, the secondary alignment is an exact match, as currently implemented,
  2. if the user turns this off, the program will switch to a set of default --main-perc-identity and --main-query-cov values, and run it that way, or
  3. if the user doesn’t turn off the exact option, but does enter (a) main-perc-identity and/or main-query-cov parameter(s), then the exact function isn’t run

just:
--p-prefilter-query-cov
--p-prefilter-query-cov
and
--p-query-cov
--p-perc-identity

Apologies for the silly question, but I’m curious what the appropriate steps are:

  1. I’ve forked the repo.
  2. I’ve modified the code.
  3. test?

I’m stuck on how to test. Would I create a new Conda/QIIME environment, then manually edit the _vsearch.py script to reflect my updates, then try it out on a sample data set? What’s the smarter way to do this?

p.s. I think you can see what I’ve got so far here

thanks!

Testing is where writing a QIIME 2 plugin/method gets a lot more advanced. q2-feature-classifier has a suite of unit tests that will automatically run when a pull request is made to the qiime2 repository, or when triggered by a local testing framework like nosetests or pytest. You can add a new test function to the existing tests for the hybrid classifier:

This is somewhat more difficult than just shuffling around pieces in the pipeline like what you’ve done so far (good work by the way!) but if you’re up for an afternoon of learning (a.k.a. banging your head against the wall) you will reached a new level in your python programming abilities.

Writing a simple unit test and cleaning up your method are the requirements to contribute your changes to the source code… but all of this is probably way more than you set out to accomplish! So if you just want to make something that will work locally, I recommend just testing out manually with some test data as you have described.