Understanding taxonomic assignment

colinbrislawn · March 24, 2020, 7:54pm

Hello David,

Great questions!

Let's start with the taxonomy.

Correct! The ASV matches to this specific entry in SILVA down to the species level, but SILVA does not yet list a species for this entry.

Probably. Not sure how SILVA makes this decision.

That's right. There's a big bucket of D_5__unknowns, which is why it can be helpful to summarize these results at the D_4__ Family level.
(If another sequence has the same ASV id, a9016c5734d00d83a3741982ceb49c44, it should have the exact same DNA sequence.)

Now let's talk about the magic of a Naive Bayes k-mer Classifier.

Naive Bayes is an old-school (i.e. 1960s) supervised machine learning classification method.
k-mers are the collection of substrings of a sequence. It's like the summary of the sequence, and similar sequences will have similar k-mer compositions.

This method takes each sequence, counts it's k-mers, and then does the Naive Bayes thing to classify it, based on k-mers and taxonomy from the database on which it was trained.

Which is really close, but not quite what you had described

It compares it's k-mer composition to the k-mers in the database.

What you described is pretty much what classify-consensus-vsearch does, if you want to try that for comparison!

Colin