Why not use 100% database identity?

In terms of taxonomic classifier, both gg and silva data set have 99% seqs. What does that mean? ASVs are 100% similarity sequences, so can we classify our ASVs using 99% reference sequence.What difference and link between 100% and 99%.Thank you.

3 Likes

Hi @Kevin,

There are quite a couple of reasons to use 99% for taxonomy / sequence reference databases.

  • Many reference sequences do not come from high throughput sequencing, and hence can not be denoised. Clustering is still a good way to reduce complexity of the data. We decided to use SILVA NR99 for QIIME2 for the reasons outlined here.
  • This makes the use and curation of the reference data more practical to use:
    • reduces the memory footprint of taxonomy classifiers
    • tasks (both manual and automated) of the reference database much easier, as there is quite a bit of redundant information.

If you’d like to make your own 100% reference database for SILVA and other reference databases, give RESCRIPt a try. The preprint just recently went online, and the tutorial is here.

-Mike

4 Likes

@Kevin

I had this question as well - this was a really helpful thread for optimizing truncation parameters and explaining biological variation in the targeted variable region. Some of the replies above and below this linked one give more context.

3 Likes

It’ll be helpful. Thanks a lot.

Very useful information, Thanks!