How does the sklearn classifier handle ambiguous bases?

Hello,

The title captures most of the question. How would it interact with say, a query sequence with a number of Ns padded within or at its beginning? Will it match any of the database entries, be ignored, or look for Ns in the training set as well? If someone can suggest a good read on the algorithms of the blast, vsearch, and sklearn tools, that would be great too. Right now, I am working on a dataset with concatenated reads (i.e. paired end amplicon reads with no overlap), and I'm trying to make sense of how each tool is affected by that. So far, vsearch has had the least unassigned reads and the best resolution. Any input is appreciated, thanks!

2 Likes

Hi @RielAlfonso ,
As far as I recall, it doesn't (but would need to dive into the source code to confirm), it handles these literally for the reasons below. It is generally a good idea/best practice to remove sequences with stretches of Ns prior to classification, because these will increase uncertainty/error rate (the presence of an N is, generally, a miscalled base, except where intentionally inserted as in your case). So most upstream QC and denoising steps are designed to eliminate reads with ambiguous bases. Likewise, most steps in reference database preparation should also remove seqs with m/any Ns (if you have significant errors in a sequence, can you really rely on it as a reference?), especially stretches of Ns, which are a bad sign. So if I recall correctly, we designed the sklearn classifier action based on the assumption that most ambiguous bases would in general be removed upstream. And if there is an odd N or two in a sequence (as often users might tolerate a low number), this should not impact classification too much as it will just lead to a low level of pollution in the kmer profile (the adjacent kmers will be unaffected and the polluted kmers will just find no match). On the other hand, long stretches of Ns (as should almost never occur in a typical workflow) could lead to significant misclassification issues if there are any reference sequences that also contain stretches of Ns (usually low quality and poorly annotated as a result!).

But your use case is a bit of a special one:

Inserting Ns as a spacer in a sequence will work with a global aligner like VSEARCH. But it will not work with classify-sklearn and we generally discourage this practice since it can be a bit misleading/raise other issues with phylogeny etc. See this topic (and the linked topic) for more discussion on this point:

As the only global aligner in the bunch, that makes sense! This is why vsearch supports a concatenation with N-spacer option; but QIIME 2 so far does not, as this will misbehave with the other tools available in QIIME 2 (but theoretically a type like FeatureData[SequencewithSpacer] could be created to support using vsearch for merging, then clustering, then classification (but avoid passing these seqs to other tools where N-spacers will cause chaos).

2 Likes