Why not use 100% database identity?

SoilRotifer · October 7, 2020, 1:16pm

There are quite a couple of reasons to use 99% for taxonomy / sequence reference databases.

Many reference sequences do not come from high throughput sequencing, and hence can not be denoised. Clustering is still a good way to reduce complexity of the data. We decided to use SILVA NR99 for QIIME2 for the reasons outlined here.
This makes the use and curation of the reference data more practical to use:
- reduces the memory footprint of taxonomy classifiers
- tasks (both manual and automated) of the reference database much easier, as there is quite a bit of redundant information.

If you'd like to make your own 100% reference database for SILVA and other reference databases, give RESCRIPt a try. The preprint just recently went online, and the tutorial is here.

-Mike