exclude-seqs: Exclude sequences by alignment

Hi @Melissa_Soh ,

What are these ASVs? is it biologically reasonable to exclude them? E.g., can you confirm that they are contaminants etc?

classify-sklearn does not perform alignment, it is a naive Bayes classifier based on kmer frequencies and the 0.7 is the confidence score, not a percent identity. So it is not relevant to consider in this context.

the 0.97 percent-identity default is based on the scenario where you might be filtering out something really specific so, e.g., looking for matches to a set of host sequences. I agree it is too strict for inclusion criteria in your case.

the perc-query-aligned and perc-identity should be defined based on evidence and expectations. How dissimilar are the sequences that you are trying to exclude? There are various publications out there describing the similarity of 16S rRNA gene sequences within different taxonomic groups, which you could consult to answer this.

perc-query-aligned 0.10 is very low and does not make sense unless if you expect the query to be a very poor match to the reference for some reason (unlikely for 16S rRNA genes)

Any chance the "bad ASVs" are host mitochondrial sequences or other non-target DNA? Instead of using q2-quality-control you could just use q2-taxa and exclude based on taxonomic label (e.g., classification to mitochondria).

Good luck!

1 Like