Match identity in classify-sklearn

Barbs · June 7, 2018, 3:11pm

Hello,

I am working with a pre-fitted sklearn-based taxonomy classifier. I am wondering how exactly taxon identities are assigned to the sequences. In other classifiers there is an option to reject the match if percent identity to query is lower than a given value [0.0, 1.0].

similar to
--p-perc-identity FLOAT in the BLAST+ consensus taxonomy classifier

Or is there a fixed default value?

Thanks a lot! Barbs

colinbrislawn · June 7, 2018, 6:18pm

Hello Barbs,

The sklearn-based taxonomy classifier does not perform classification in the same way as 'search then LCA' classifiers like vsearch and blast.

You can read more about the sklearn machine learning method on line 126 of this preprint, or in the original sklearn-paper.

I'm still looking for a quick and easy explanation of how the sklean classifier works. One of the Qiime devs should point us to that!

Colin

antgonza · June 7, 2018, 6:50pm

I guess the question here is, how does Naive Bayes works, right?

Hope this helps.

Barbs · June 8, 2018, 8:55am

Thanks @colinbrislawn & @antgonza!

Turns out it always helps to read something about the underlying concept before using a method! Not that I understand the sklearn-based taxonomy classifier yet but I get that its not alignment based.

Thanks a lot! Barbs

colinbrislawn · June 8, 2018, 5:55pm

Hello Antonio,

I was hoping for a one-sentence summary.

Blockchain: a public ledger with an incentives problem
Selection sort: keep picking the smallest value until you have all the values
classify-consensus-blast: find the top database hits and take their common levels
classify-consensus-vsearch: find the top database hits and take their common levels

I want to acknowledge that these summaries are only one-sentense because we have domain knowledge. We don't specify how our ledger is public or how we find top hits in our database.

Still, these short summaries are the hook, so that we can start exploring the details and understand the implications. Like, you can instantly see that the blast and vsearch methods are really similar, and might wonder what happens if your database doesn't have any hits. Now that's a pretty good way to start a conversation!

Can someone write a one-sentence summary for classify-sklearn?
@mortonjt @yoshiki @wasade @BenKaehler

Colin

BenKaehler · June 9, 2018, 2:15am

How about

classify-sklearn: classify sequences by k-mer abundance using a scikit-learn classifier

There are, of course, more details.

I’ll have a chat to @Nicholas_Bokulich about writing something more substantial.

colinbrislawn · June 9, 2018, 2:33am

Thanks Ben!

That's pretty good! How does classify-sklearn's use of kmers compare to Wang k-mer classifier implemented in RDP and also Mothur? Is it the same algorithm, just implemented using scikit-learn?

This has been really helpful!
Colin

BenKaehler · June 9, 2018, 11:09pm

No worries.

My understanding of the Mothur implementation is that it is a reproduction of the RDP algorithm, please correct me anyone if that is not correct.

fit-classifier-naive-bayes creates a naive Bayes classifier, and RDP uses a naive Bayes classifier, so they are fundamentally the same thing.

Implementation details differ, and we should definitely go into that somewhere. I'm chatting with @Nicholas_Bokulich about the appropriate channel.

The benchmark paper discusses what is important from a user perpective.

system · July 11, 2018, 5:18am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.