In terms of taxonomic classifier, both gg and silva data set have 99% seqs. What does that mean? ASVs are 100% similarity sequences, so can we classify our ASVs using 99% reference sequence.What difference and link between 100% and 99%.Thank you.
Hi @Kevin,
There are quite a couple of reasons to use 99% for taxonomy / sequence reference databases.
- Many reference sequences do not come from high throughput sequencing, and hence can not be denoised. Clustering is still a good way to reduce complexity of the data. We decided to use SILVA NR99 for QIIME2 for the reasons outlined here.
- This makes the use and curation of the reference data more practical to use:
- reduces the memory footprint of taxonomy classifiers
- tasks (both manual and automated) of the reference database much easier, as there is quite a bit of redundant information.
If you’d like to make your own 100% reference database for SILVA and other reference databases, give RESCRIPt a try. The preprint just recently went online, and the tutorial is here.
-Mike
4 Likes
I had this question as well - this was a really helpful thread for optimizing truncation parameters and explaining biological variation in the targeted variable region. Some of the replies above and below this linked one give more context.
3 Likes
It’ll be helpful. Thanks a lot.
Very useful information, Thanks!