Hi, I have successfully run sidle on my own primer sets. However, the results showed that many sequences were assigned taxonomy as 'genus A | genus B', i.e., 'sequences differ at some higher level' as you mentioned in the sequence reconstruction tutorial. We noticed that the original SMURF/5R algorithm would not generate such results. These representations are confusing because we sequenced pure E.Coli. while some sequences were assigned 'g__E.C. | g__Klebsiella'.
On page 13 of the supplementary doc of 5R (Deborah Nejman et al, 2020), it is said that 'for groups comprising more than one 16S rRNA sequence, taxonomy is assigned by a majority vote'. I wonder how I can achieve this by using sidle/python scripts. And any suggestions when encountering these mismatches among regions (Would I discard or uniquely assign taxonomy) ? Thank you very much!
You're running into 3 issues here, only one of which is sidle related.
First, E. coli doesn't have a unique 16S sequence. There just isn't sufficient resolution to distinguish E. coli from a lot of other things in genus Enterocobacteraeae. This isn't an algorithmic issue, or a database issue, it's a biological one. So, even if you know it's pure E. coli, it's going to be hard to get a 16S classifier to determine that unless you modify your database to exclude anything that looks like E. coli but isn't. I personally wouldn't recommend this for biological sequences.
There is currently no majority vote functionality in Sidle; you're be welcome to open a pull request to handle this and work from there.
My recommendation is to take a deep breath, exhale, and get comfortable with ambiguity. Which I recognize sounds flippant. But, the annotation reflects the algorithm, database, and settings. If you discard sequences where the sequences couldn't be resolved across your regions, then you're throwing out potentially informative information, especially in places where taxonomic names and 16S sequences don't like up, like Enterobacteraceae.
Microbiome taxonomy is all kinds of messy and biased. It's in almost constant flux. You're presenting a hypothesis that reflects a set of processing steps and choices that you (and others) made. So, just let this be one more layer of things not being consistent.