Why not use 100% database identity?

Kevin · October 6, 2020, 5:22pm

In terms of taxonomic classifier, both gg and silva data set have 99% seqs. What does that mean? ASVs are 100% similarity sequences, so can we classify our ASVs using 99% reference sequence.What difference and link between 100% and 99%.Thank you.

SoilRotifer · October 7, 2020, 1:16pm

Hi @Kevin,

There are quite a couple of reasons to use 99% for taxonomy / sequence reference databases.

Many reference sequences do not come from high throughput sequencing, and hence can not be denoised. Clustering is still a good way to reduce complexity of the data. We decided to use SILVA NR99 for QIIME2 for the reasons outlined here.
This makes the use and curation of the reference data more practical to use:
- reduces the memory footprint of taxonomy classifiers
- tasks (both manual and automated) of the reference database much easier, as there is quite a bit of redundant information.

If you'd like to make your own 100% reference database for SILVA and other reference databases, give RESCRIPt a try. The preprint just recently went online, and the tutorial is here.

-Mike

hsapers · October 7, 2020, 2:00pm

@Kevin

I had this question as well - this was a really helpful thread for optimizing truncation parameters and explaining biological variation in the targeted variable region. Some of the replies above and below this linked one give more context.

Kevin · October 7, 2020, 2:36pm

It'll be helpful. Thanks a lot.

Kevin · October 7, 2020, 2:39pm

Very useful information, Thanks!