Recently I have been trying to analysis my ITS amplicon sequences data and I found a little bit confusing thing about the Unite database.
On the Unite database they have distributed two different type of fungi unite database in all formats, not only qiime pre-format:
Includes singletons set as RefS (in dynamic files).
Includes global and 97% singletons.
I have read the README documents in Unite but they did not mention why they posted two versions.
I do know what means about singleton , RefS or dynamic files but I can not figure out what it means together ?
Does anyone familiar with Unite? Which one should I choose normally?
May I ask if you understand? I encountered the same problem
Here is my understanding.
(I am not part of the UNITE dev team, so my understanding may be incomplete.)
Context: the UNITE database is clustered
The UNITE database is distributed at three clustering levels:
Some clusters in UNITE are chosen manually by the devs and others are included automatically.
refs = this is a manually designated RefS
(reps = this is an automatically chosen RepS)
The problem: 'singleton' clusters
Most output clusters represent multiple input reads, but some 'singleton' output clusters represent only one input read!
(Do you want this in your database? If a word is spelled diffffffferently is it wrong or novel? )
The choice: do you want to include singleton clusters?
There are a bunch of automatic RepS clusters that represent a single read, and you may not want to include these 'singleton' clusters.
- Includes singletons set as RefS (in dynamic files).
Singletons have been removed from 99% and 97%
- Includes global and 97% singletons.
Includes all the singletons!
I would also love the UNITE team to clarify this.