Quality-control evaluate-seqs histogram

devonorourke · January 15, 2019, 2:58pm

I'm looking at the mismatch_histogram.png output of the evaluate-seqs function and wondering how the y-axis counts are binned.

My guess is that the x-axis, labeled distance to nearest expected feature represents the percent identity of the alignment between query and subject. However, if this is correct, this isn't necessarily a useful feature because you can be including sequences with 100% identity but an ambiguous coverage percentage. It seems like this part of the code indicates that's what's going on, but apologies if I'm overlooking something.

I wonder if the plot I'm looking for is more of a kernel density plot with the x and y axes representing the coverage and identity, with the shade representing the counts. A bit more complicated, but it'll avoid lumping in a bunch of query sequences that may have few mismatches but align over just a fraction of the subject reference sequence. I'm not sure if pandas does that 2d kernel density plot, but it looks like their hexbin plot will do the same kind of thing. I have seen the 2d kernel density plot within the seaborn version, for what it's worth.

Here's one way I was thinking about these same data - notice how by including both coverage and alignment the number of bins both increases, but also highlights the fact that there are several sequences with my target amplicon length (180 or 181bp) at that 100% identity (I highlighted these in red), but also many others which fall short of the full coverage.

Thanks for the help!

Nicholas_Bokulich · January 15, 2019, 3:19pm

Yes, you are correct. This visualizer was designed sort of with the assumption that reads are the same length.

Care to contribute? It looks like you have already put together some code to do this, and it would be a good advancement for this visualizer!

devonorourke · January 15, 2019, 8:29pm

Sounds like my first Python challenge @Nicholas_Bokulich!

Hoping to take up that challenge sometime this spring, but no promises

I wanted to mention that I thought a little harder about the effect of plotting percent identity with percent coverage (or just the alignment length value); one thing that became clear to me is that the binning approach that I posted earlier is fine when you want to ask "what are the most common coverage/identity values", but that's very misleading. Common in this case doesn't take into consideration the read depth for each ASV observed at that coverage/identity.

So I built another plot. This one doesn't bin ASVs at all; rather, each point is a distinct ASV:

I think it's a helpful way of visualizing the fact that the sequences which are basically exact matches are the only ones with large numbers of reads. Maybe there's another way of binning all of this, but I can't see how to fit four dimensions in a histogram.

Nicholas_Bokulich · January 15, 2019, 9:14pm

Thanks @devonorourke! That looks much more interesting than the histograms. I have opened an issue so that we can add a plot like this to that visualization. If you get a chance this Spring, please do grab that issue!

system · February 16, 2019, 3:14am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.