Hello qiime2 community!
This topic has the objective to clarify some questions regarding the choice of the threshold used for perc-identity
and perc-query-aligned
when using exclude-seqs
.
According with the qiime2 document I always use the default (0.97), my idea is that if I don't have an expressive lost of feature sequences that could bring some bias to my subsequent analysis (diversity, abundance), it's reasonable to use the most accurate threshold when comparing my sequences with an SILVA reference data.
However, a reviewer from my paper brought a question that left me in doubt about how to respond:
Why did you remove sequences that were < 97% similar to known sequences? This would exclude all taxa whose species is not yet represented in the databases. However, even an unknown bacterial species could be identified to family or genus level.
So I visited this post and saw that @Nicholas_Bokulich said:
How dissimilar are the sequences that you are trying to exclude? There are various publications out there describing the similarity of 16S rRNA gene sequences within different taxonomic groups, which you could consult to answer this.
Perhaps one of that publications would be a good reference to indicate for my reviewer (could you give me some recomendations?)? And the most important, is my strategy for choosing the perc-identity
and perc-query-aligned
threshold correct? Or should I rely on other observations?
Thanks in advance!