Hi @nounou,
Thanks for following up on this from the previous discussion.
Couldn't agree more, tough to find available tools for this but thank you very much for sharing your code! I suspect spacers are going to become more popular and the need for this kind of tool will be greater than ever. If I may recommend one thing, if you get some time, it may be useful for newer users if there was an example workflow in the readme, this would certainly increase the reach of your tool, and may help detected any potential issues/bugs as well.
What is the reasoning behind discarding those reads, and I'm curious as why some of these reads already don't have primers? I can't recall from our previous discussion. Why not keep keep all the reads and trim the primers on reads that have them. Keeping universal primers in your reads doesn't really add resolution to the data and in fact people have reported that classification is more accurate without them.
Great! That is to be expected with shorter reads. And glad you got better classification!
That really depends on your samples. For human/mice fecal samples, that number certainly seems on the high end of the spectrum. But perhaps as you mentioned that is normal in your insect data? What is surprising though is that when you removed these HS compare to when you hadn't removed them before, you actually end up with even more rep-seqs (you mentioned 27K in the previous post). This doesn't sit right with me as I would expect a fold drop in your # of rep-seqs roughly equal to the number of different HS you had. So I'm not sure why this number actually went up. Do most of your reads get taxonomic assignments? What portion of your reads are unassigned, unknown, or only assigned to Kingdom/Domain level?
Lastly, have you confirmed your tool is working properly and as expected? For example, what happens if the expected HS sequence is off by a single nt etc.