I am working with a pre-fitted sklearn-based taxonomy classifier. I am wondering how exactly taxon identities are assigned to the sequences. In other classifiers there is an option to reject the match if percent identity to query is lower than a given value [0.0, 1.0].
similar to
–p-perc-identity FLOAT in the BLAST+ consensus taxonomy classifier
Turns out it always helps to read something about the underlying concept before using a method! Not that I understand the sklearn-based taxonomy classifier yet but I get that its not alignment based.
Blockchain: a public ledger with an incentives problem
Selection sort: keep picking the smallest value until you have all the values
classify-consensus-blast: find the top database hits and take their common levels
classify-consensus-vsearch: find the top database hits and take their common levels
I want to acknowledge that these summaries are only one-sentense because we have domain knowledge. We don't specify how our ledger is public or how we find top hits in our database.
Still, these short summaries are the hook, so that we can start exploring the details and understand the implications. Like, you can instantly see that the blast and vsearch methods are really similar, and might wonder what happens if your database doesn't have any hits. Now that's a pretty good way to start a conversation!
That's pretty good! How does classify-sklearn's use of kmers compare to Wang k-mer classifier implemented in RDP and also Mothur? Is it the same algorithm, just implemented using scikit-learn?