I’m looking at the mismatch_histogram.png
output of the evaluate-seqs function and wondering how the y-axis counts are binned.
My guess is that the x-axis, labeled distance to nearest expected feature
represents the percent identity of the alignment between query and subject. However, if this is correct, this isn’t necessarily a useful feature because you can be including sequences with 100% identity but an ambiguous coverage percentage. It seems like this part of the code indicates that’s what’s going on, but apologies if I’m overlooking something.
I wonder if the plot I’m looking for is more of a kernel density plot with the x and y axes representing the coverage and identity, with the shade representing the counts. A bit more complicated, but it’ll avoid lumping in a bunch of query sequences that may have few mismatches but align over just a fraction of the subject reference sequence. I’m not sure if pandas does that 2d kernel density plot, but it looks like their hexbin plot will do the same kind of thing. I have seen the 2d kernel density plot within the seaborn version, for what it’s worth.
Here’s one way I was thinking about these same data - notice how by including both coverage and alignment the number of bins both increases, but also highlights the fact that there are several sequences with my target amplicon length (180 or 181bp) at that 100% identity (I highlighted these in red), but also many others which fall short of the full coverage.
Thanks for the help!