How do Qiime2 naive Bayes classifiers handle query sequences shorter than reference sequences used for training?

elmorejoanna · April 10, 2024, 11:18pm

I apologize if this does not belong here (and for the long post), but I have a question about techniques used in Qiime2 naive Bayes classifiers (using q2cli version 2022.2.0).

My understanding is that, during training, classifiers are not necessarily "aware" of reference sequence length. Instead, counts of kmers extracted from reference sequences are normalized using L2 normalization, and that is how the model accounts for varying sequence lengths. Once trained, does the classifier treat kmer counts of query sequences similarly?

Say I've trained my classifier on a custom mitochondrial 16S database with reference sequences ~130 bp long. I assume the classifier does not pad shorter query sequences (~80bp) with placeholder characters before kmer extraction. Instead, do these kmer counts go through some normalization to account for length variation? Or in other words, how does/can the classifier penalize the confidence of assignments for short sequences due to missing information?

I work at a research facility where we frequently need to assign taxonomy to metabarcoding reads generated from generally low quality, non-invasive samples (scat, swabs, bat guano- hey @devonorourke, big fan!). So short reads respective to the target amplicon are just a problem we have to deal with. I've been excited to introduce custom databases and NBC classifiers into our lab pipeline but I'm struggling a bit when it comes to explaining why things may break down with the shortest sequences in our data. One thing I am wondering, and what may be happening, is that the classifier extracts kmers from my short reads that may just now retain more conserved regions and less distinguishable variation, doesn't penalize confidence (enough?) for missing-ness in these short sequences, and outputs overclassified assignments with high confidence. One example is my classifier assigning an 88bp read to Molossidae (bat family) with 0.82 confidence while blast assigned it to Canis lupus (96% ID, 100% coverage, and way more likely, as this came from field tech collected swift fox scat). And I have plenty of Canis lupus sequences in my reference database. However, I think it's important to note that it does seem like the classifier performs very well overall!

Anyway, any answers to my questions or any insights on how I may better apply the classifier would be so greatly appreciated! I'm also hoping to soon investigate if the vsearch/classifier hybrid better suits my specific data/needs. Thanks in advance.

Nicholas_Bokulich · April 11, 2024, 5:22am

Hi @elmorejoanna ,

Welcome to the forum!

This is a great question. The NB classifiers are generally quite robust to length variation. This information is not accounted for at all during classifier training or during classification, and no penalty is applied. The closest match instead is found based on kmer frequencies using the NB algorithm. So as long as the reference and query sequences are of reasonable lengths so that the kmer distributions are still representative, the classification performs well even if there are slight variations in sequence length or in start/end of the sequence. One clear example: you can use full-length reference sequences to classify short variable domain amplicons (e.g., classify V4 domain sequences with a full-length 16S classifier).

But accuracy does begin to degrade when the query or reference sequences are very short and the kmer distribution becomes skewed/non-representative (e.g., short enough that the sequence covered is highly conserved and hence the kmer distribution is no longer indicative of a given species).

How short is too short? This will depend on the target gene and how variable it is, so it is tough to say off-hand. But you could simulate sequences of different lengths and test how accurately the NB classifier classifies them at each length (and for this you could use the RESCRIPt plugin to evaluate accuracy at each level).

this is surprising, because even if the kmer profile is short enough that it hits Molossidae with high confidence, presumably it should also hit Canis sequences, and hence I would expect that the taxonomy would resolve to some LCA between these (e.g., Mammalia). If the query or reference sequences are in incorrect or mixed orientations this would explain the misclassification. Or if some of your reference sequences are abnormally short or misannotated (issues that unfortunately occur to some degree with many reference databases)

Don't bother — it will not. The hybrid classifier uses VSEARCH in exact mode, which looks for exact matches (end-to-end) between the query and reference, with zero tolerance for length variation. So this will not work in your case.

But what you could do:

try adjusting the confidence parameter to see how it performs with your target gene. The default settings were optimized for bacterial 16S and fungal ITS sequences, so for mito-16S it might benefit from some re-optimization.
you can classify with multiple methods (e.g., NB, the vsearch-based consensus classifier, different parameter settings) and then use RESCRIPt to create a consensus taxonomy from these based on LCA, most frequent, or another consensus method.
Check your reference seqs. Is it possible that you have a Canis sequence mistakenly annotated as Molossidae? Or in the incorrect orientation?

Good luck!

devonorourke · April 11, 2024, 11:18am

A recent video exploring how RDP works helps visualize the Naive Bayes algorithm. I think it can help give a bit more context to the terrific response by Nick. See: https://youtu.be/VbXkK_nsmu4?si=CXLPWb3XUT7OGCvR

system · May 12, 2024, 5:19pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.