exclude-seqs: Understanding how is the best threshold for perc-identity and perc-query-aligned

Hello qiime2 community!

This topic has the objective to clarify some questions regarding the choice of the threshold used for perc-identity and perc-query-aligned when using exclude-seqs.

According with the qiime2 document I always use the default (0.97), my idea is that if I don't have an expressive lost of feature sequences that could bring some bias to my subsequent analysis (diversity, abundance), it's reasonable to use the most accurate threshold when comparing my sequences with an SILVA reference data.

However, a reviewer from my paper brought a question that left me in doubt about how to respond:

Why did you remove sequences that were < 97% similar to known sequences? This would exclude all taxa whose species is not yet represented in the databases. However, even an unknown bacterial species could be identified to family or genus level.

So I visited this post and saw that @Nicholas_Bokulich said:

How dissimilar are the sequences that you are trying to exclude? There are various publications out there describing the similarity of 16S rRNA gene sequences within different taxonomic groups, which you could consult to answer this.

Perhaps one of that publications would be a good reference to indicate for my reviewer (could you give me some recomendations?)? And the most important, is my strategy for choosing the perc-identity and perc-query-aligned threshold correct? Or should I rely on other observations?

Thanks in advance!

Hi @joaomiranda ,
I must say, I agree with this reviewer as well, 97% similarity is very strict for inclusion. This is the default for that action because, well, something needs to be used by default and 97% is more appropriate for, e.g., to remove specific non-target sequences (like filtering for sequences that hit host sequences). But as mentioned on that other post, 97% is too stringent for a broad filter like this, where you are aiming to remove sequences that are too dissimilar from the reference sequences. Unless if your goal really is to put on "blinders" and basically just focus on sequences that very closely match previously described sequences (and for sure there are cases for this as well).

To get a quantitative reference that you could use to set a filtering threshold, you could check out the GTDB paper... it is a specific case shows how (in their database) sequence similarity is associated with different bacterial taxonomic ranks.

Good luck!

3 Likes

Thank you so much for your insights!

1 Like