Building a database in order to train a classifier

I have datasets of amplicon generated for a single copy gene (460b.p.) within a specific gut related taxon (Genus level) that I want look through for species and strain variants. I have some questions about building the database to do this. I’m mostly wondering what range of sequence similarities I should use in the alignment? For example, I expect most of the sequences to be >90% sequence similarities with species clusters of 95-97% and I’ll bet the strain variants I want are >99% similar. Should I use a bunch of highly similar (>99% sequences) for within each of the expected species? what about outgroups since there may be stuff in there I can’t account for? Are there disadvantages to having this kind of discontinuous similarity in the database? Thanks!!!