Training classifier, clustering sequences and picking representative sequence

DoHarris · April 27, 2019, 5:54pm

Dear QIIME Users Community,

I recently shifted from QIIME1 to QIIME2. Pardon my ignorance, but I would like to get some clarifications:

What is the ideal Greengenes taxonomy to use for analysis of microbiota data, e.g., vaginal microbiota data? Is it Greengenes 99% OTU sequences with 99% OTU taxonomy or Greengenes 97% OTU sequences with 97% OTU taxonomy?
What method is used to cluster sequences into ASVs in QIIME2 using DADA2 plugin?
What is the similarity threshold used to cluster sequences into ASVs?
What method is used to pick the representative sequences is QIIME2 using DADA2 plugin?
Lastly, I want to assume that the default chimera checking method (consensus) in DADA2 two is de novo based.

Mehrbod_Estaki · April 27, 2019, 10:04pm

Hi @DoHarris,
Welcome aboard!

While there is no validated correct answer here, I would say most people today tend to lean towards the higher percentages for higher resolution. I posted a few links with regards to this notion that discusses both sides of the debate if interested. I would personally use the 99% database.

At the moment DADA2, and Deblur are the 2 plugins available that produce ASVs but neither use clustering methods. These denoising methods are different than OTUs. See paper linked below.

I think you may benefit from reading this paper with regards to sequence variants. ASVs are different than OTUs in that they are exact sequence variants, so essentially 100% similarity threshold. There is no clustering which means even a single nt difference will lead to the formation of a new ASV.

As there is no clustering being performed, there is no special method in picking representative sequences as it use to be with OTUs. They are simply the true sequence representing that group.

You are correct, the default setting is de novo. See here and here for a bit more discussion about this if interested.

DoHarris · April 28, 2019, 12:06pm

Thanks Mehrbod_Estaki for your comprehensive response.