How to create a dereplicated sequence reference database for taxonomy classification: case of COI

Nicholas_Bokulich · October 31, 2018, 8:49pm

Great questions! but unfortunately out of scope for QIIME 2 — this is a question specifically for creating reference databases, and QIIME 2 was not designed for that use case. I am also not sure there is a good answer. Check out what SILVA does for generating the QIIME 2-compatible database — there may even be scripts somewhere that you can use to replicate. They solve that issue by providing several taxonomies — both majority and consensus taxonomies to cover the cases you describe.

Correct, that just takes care of sequences.

Yep, it is a difficult problem! QIIME 2 is not made to solve that problem, so we defer to the database experts — e.g., SILVA, greengenes, UNITE. I recommend following in their footsteps, unless if you want to put together some custom code to do this.

Incidentally, I have been planning on putting together something that would do precisely what you ask, but I have not gotten around to it.

Sorry I have not really answered your question! Only confirmed that it does not have a trivial answer.