Which version of UNITE database should I choose?

Recently I have been trying to analysis my ITS amplicon sequences data and I found a little bit confusing thing about the Unite database.
On the Unite database they have distributed two different type of fungi unite database in all formats, not only qiime pre-format:

  1. Includes singletons set as RefS (in dynamic files).

  2. Includes global and 97% singletons.

I have read the README documents in Unite but they did not mention why they posted two versions.
I do know what means about singleton , RefS or dynamic files but I can not figure out what it means together ? :joy:
Does anyone familiar with Unite? Which one should I choose normally? :face_with_raised_eyebrow:

10 Likes

May I ask if you understand? I encountered the same problem

Here is my understanding.
(I am not part of the UNITE dev team, so my understanding may be incomplete.)

Context: the UNITE database is clustered

The UNITE database is distributed at three clustering levels:

  • 99%
  • 97%
  • 'dynamic'

Some clusters in UNITE are chosen manually by the devs and others are included automatically.

refs = this is a manually designated RefS
(reps = this is an automatically chosen RepS)

The problem: 'singleton' clusters

Most output clusters represent multiple input reads, but some 'singleton' output clusters represent only one input read!

(Do you want this in your database? If a word is spelled diffffffferently is it wrong or novel? :thinking: )

The choice: do you want to include singleton clusters?

There are a bunch of automatic RepS clusters that represent a single read, and you may not want to include these 'singleton' clusters.

  1. Includes singletons set as RefS (in dynamic files).
    Singletons have been removed from 99% and 97%
  1. Includes global and 97% singletons.
    Includes all the singletons!

I would also love the UNITE team to clarify this. :heart: :mushroom:

2 Likes