Trying to assign OTU ID to each raw sequence

Hello again Louise,

Cool! So it sounds like the goal is the downstream biology, insead of algorithm development.

With that in mind, I think it makes sense to focus on an elegant way to integrate these 40 studies, without getting side tracked by differences between hypervariable regions.

Sure. De novo (Latin 'from nothing') OTU clustering makes new OTUs based on the reads provided. Closed-ref OTU clustering does not make new OTUs at all, it simply counts matches to existing OTUs provided in a database.

Take a look at this thread. Their ion-torrent data also spans different regions, so they are dealing with the same underlying problem as you. They also consider using closed-ref OTUs for these reasons:

This makes sense. You can only get OTUs in that specific database, so this bias is huge. Hopefully alpha and beta are similar :stuck_out_tongue_winking_eye:

oh thank goodness! :sweat_smile:

I'm not sure about the HOMD database, but it makes sense that a special-use database would have fewer taxa than a general use database like Silva. Any OTUs not in the database will be totally missing from all downstream analysis, but that's the downside of this strong regional normalization :man_shrugging:


Now that I know more about your study, I want to go back to the beginning:
I think you choose the right method to move forward quickly: closed-ref + maybe some ASV analysis of specific taxa

Do you want to craft that vsearch command? :hammer_and_wrench:

Do you have any other theory questions about closed-ref OTUs? :mag: :memo:

Colin