Understanding taxonomic assignment

David_Bradshaw · March 24, 2020, 6:41pm

Dear QIIME2 Community,

I wanted to double check I understood how a sequence is assigned a taxonomic identity. I used the following script:

"qiime feature-classifier classify-sklearn --i-classifier silva-132-99-515-806-nb-classifier.qza --i-reads example.qza --o-classification taxonomy --p-reads-per-batch 10000 --verbose"

Does this basically take each representative sequence and compare it to the SILVA database at each taxonomic level until it is at least 70% sure it can assign a Kingdom, then Phylum, etc... or does it just compare it to the database until it can find the sequence that it most closely matches? If it is the latter then how does the 70% confidence interval come into play?

If a sequence is identified as the following:
a9016c5734d00d83a3741982ceb49c44: D_0__Bacteria; D_1__Bacteroidetes; D_2__Bacteroidia; D_3__Flavobacteriales; D_4__Flavobacteriaceae; D_5__unknown; D_6__unknown

Does that mean that whoever assigned that taxonomic id to that reference sequence was unable to determine what genus it was?

If it was uncultured instead, does that mean the sequence was similar enough to other Flavobacteriaceae sequences to assign it to that family, but not similar enough to a genus to assign it at the genus level?

If another sequence has that same taxonomic id, that does not necessarily mean that it is a different genus correct? Just that it could not be assigned to a reference sequence with a taxonomic id at the genus level?

Thank you for the time and help. Sorry for all the questions, just trying to understand it and a literature review was not helpful.

Sincerely,

David

colinbrislawn · March 24, 2020, 7:54pm

Hello David,

Great questions!

Let's start with the taxonomy.

Correct! The ASV matches to this specific entry in SILVA down to the species level, but SILVA does not yet list a species for this entry.

Probably. Not sure how SILVA makes this decision.

That's right. There's a big bucket of D_5__unknowns, which is why it can be helpful to summarize these results at the D_4__ Family level.
(If another sequence has the same ASV id, a9016c5734d00d83a3741982ceb49c44, it should have the exact same DNA sequence.)

Now let's talk about the magic of a Naive Bayes k-mer Classifier.

Naive Bayes is an old-school (i.e. 1960s) supervised machine learning classification method.
k-mers are the collection of substrings of a sequence. It's like the summary of the sequence, and similar sequences will have similar k-mer compositions.

This method takes each sequence, counts it's k-mers, and then does the Naive Bayes thing to classify it, based on k-mers and taxonomy from the database on which it was trained.

Which is really close, but not quite what you had described

It compares it's k-mer composition to the k-mers in the database.

What you described is pretty much what classify-consensus-vsearch does, if you want to try that for comparison!

Colin

David_Bradshaw · March 24, 2020, 9:03pm

Dear Colin,

Thank you very much for the in-depth explanation. I understand it now. I greatly appreciate it!!

Sincerely,

David