Hi @Kevin,
There are quite a couple of reasons to use 99% for taxonomy / sequence reference databases.
- Many reference sequences do not come from high throughput sequencing, and hence can not be denoised. Clustering is still a good way to reduce complexity of the data. We decided to use SILVA NR99 for QIIME2 for the reasons outlined here.
- This makes the use and curation of the reference data more practical to use:
- reduces the memory footprint of taxonomy classifiers
- tasks (both manual and automated) of the reference database much easier, as there is quite a bit of redundant information.
If you'd like to make your own 100% reference database for SILVA and other reference databases, give RESCRIPt a try. The preprint just recently went online, and the tutorial is here.
-Mike