Training classifier, clustering sequences and picking representative sequence

Mehrbod_Estaki · April 27, 2019, 10:04pm

Hi @DoHarris,
Welcome aboard!

While there is no validated correct answer here, I would say most people today tend to lean towards the higher percentages for higher resolution. I posted a few links with regards to this notion that discusses both sides of the debate if interested. I would personally use the 99% database.

At the moment DADA2, and Deblur are the 2 plugins available that produce ASVs but neither use clustering methods. These denoising methods are different than OTUs. See paper linked below.

I think you may benefit from reading this paper with regards to sequence variants. ASVs are different than OTUs in that they are exact sequence variants, so essentially 100% similarity threshold. There is no clustering which means even a single nt difference will lead to the formation of a new ASV.

As there is no clustering being performed, there is no special method in picking representative sequences as it use to be with OTUs. They are simply the true sequence representing that group.

You are correct, the default setting is de novo. See here and here for a bit more discussion about this if interested.