The NR99 version of the Silva fasta files are the de facto standard for use in classification. Is this because NR99 gives better results? Or because the resources requirements to process it are more modest? With DADA2 the resolution between sequence variants can be finer than 99%, so I could theoretically get ever-so-slightly better classifications using all of the reference sequences, right? But I’m not sure if there are trade-offs. If anyone in the community has thought through this before, I’d appreciate hearing which version you believe is likely to give the best results and why.
Great question. The main reason is as you surmise here:
Clustering at 99% substantially reduces the number of sequences. This makes runtime and memory consumption much much much more reasonable.
Probably not. Even though dada2 provides better resolution of the query sequences, any classifier that performs a confidence/consensus prediction is still going to be constrained. If your classification method assigns to the top-hit best alignment, then yes classifying against the full SILVA database might “improve” classification if a perfect match is clustered out in the NR99. The effects on accuracy (vs. providing an enticing strain-level hit) are even more questionable and untested…
Bottom line is I agree with you, NR99 is a bit of a compromise. But in practice the alternative is not going to appreciably improve accuracy, while substantially increasing computational cost. The juice isn’t worth the squeeze…
This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.