Here is the file that gave me my quality plots frog2.qzv (289.9 KB)
I was wondering if someone could provide insight into interpretation and what the features of this interactive plot mean (what are the most important features to pay attention to?). Also, when zooming in on a certain section, the score plots are differentiated into two different colors, blue or pink (see below).
A box plot of the quality score distribution is shown for each position in your input sequences. Since it could take awhile to compute these distributions from all of your sequence data (e.g. millions of reads), a subset of your reads are selected randomly (without replacement), and the quality scores of those subsampled sequences are used to generate the box plots. By default, 10,000 sequences are subsampled (you can control that number with --p-n on the demux summarize command). Due to this random subsampling, if you rerun demux summarize on the same sequence data, you will obtain (slightly different) plots.
When you hover the mouse over a box plot for a given base position, the box plot’s data is shown in a table below the interactive plot as a parametric seven-number summary. These values describe the distribution of quality scores at that position in your subsampled sequences.
These interactive plots can be used to determine if there is a drop in quality at some point in your sequences, which (for example) can help you with choosing truncation and trimming parameters when using DADA2.
Red box plots (and the associated red warning text) should only appear when your input sequences have different lengths. This is typically not the case with Illumina data. The warning and red box plots indicate that not all sequences’ quality scores were included at that position because some of the sequences were shorter. Thus, those red box plots don’t represent as many quality scores as the blue box plots (which include quality scores from all sequences at that position).
I think you may have found a bug related to this coloring/warning. With your forward reads, the plot is saying that there is at least one sequence that is 40bp, which is shorter than many of the other sequences. I would expect the red box plots to show up at position 41, since that would be the point where data starts getting dropped from box plots due to sequence length. However, this warning is showing up at position 150, which makes no sense. The reverse reads appear to be functioning correctly.
Could you provide me with your demultiplexed sequences? That would be the .qza file with semantic type SampleData[PairedEndSequencesWithQuality] that you used as input to demux summarize. That’ll let me reproduce and debug this issue locally. Thanks!
Ignore my request for your data – @ebolyen figured out what’s going on. This is definitely a bug and will be fixed in the next QIIME 2 release. We’ll follow up here when that happens!
TL;DR: The “minimum sequence length observed during subsampling (N bases)” that appears in the red warning text below the interactive plot is incorrect. The number being reported is the global minimum sequence length, but the subsampled minimum sequence length should be reported here. The way that you interpret these plots (blue vs. red) is the same though, and the cutoff between blue and red plots is correct. So this is a pretty minor bug but definitely makes things confusing when interpreting these warnings.
What’s happening is that the shortest sequence in your data is 40bp. However, only 10,000 of those sequences are randomly subsampled, and their quality scores are plotted. These subsampled sequences (by chance) did not include that really short (40bp) sequence, so the box plots actually include quality scores from all subsampled sequences, up to 150bp (which is when the sequence lengths of the subsampled sequences begin to differ). Thus, the interactive plot is correctly warning (and coloring the box plots in red) at position 150 because that’s the point where some sequences are longer than others in the subsampled sequences. The bug here is that 40bp (the global minimum) is being reported in the error message, when it really should be reporting the subsampled minimum of 149bp.
Let me know if you have any more questions about interpreting these plots!