I'm running QIIME2 2019.1 and I am using demux summarize to look at the quality of my reads.
I have paired end reads and am getting the warning that position 297 is greater than the minimum sequence length observed during subsampling (296 bases) for my forward reads and position 33 for my reverse reads.
I'm not really worried about the forward reads as 297 is almost the entire read length, but 33 bases is extremely short for the reverse read.
After reading this post I saw that I should look at the demultiplexed sequence length summary to verify the length of the sequence. The difference I'm having from the previous post is that my demux summary length shows that the low end of my read distribution is 298 nt for my reverse reads, not the 32 bases the graph was showing me.
I'm wondering which metric is accurate, the graph or the demultiplexed sequence length summary table?
These are the commands I ran to get to this point:
The key thing to keep in mind is that that is still a distribution --- you might only have one read that is 32 nts long --- if that were the case, I wouldn't expect that to manifest itself in the 7-number summary distribution, right? 1 read of the 10,000 subsampled reads is 0.01%, which is << 2%, the lowest percentile indicated in the summary.
With the comments above in mind, would you agree with the answer that they both are? The thing to keep in mind I guess is that they are both incomplete views of the same data, since they are both summaries of said data.
Okay, with all that in mind, since your 7 number summary distribution of read lengths for the reverse reads is 298-300 nts (a nice succinct range!), personally I wouldn't worry about the 33 nts warning in the box plot above, since it this is clearly only the case for very few reads (fewer than 2%, eh?)
I do have a clarification question though. Is one read enough to make that error occur for the graph? In other words, am I seeing this error at each position because each position is (potentially) selecting a short read during the random sampling? If there are so few short sequences how/why are they taking over the random sampling plot?
Hopefully that makes sense, I’m struggling to articulate my confusion.
The distribution plot is telling you that not all of the box plots are made up of the same number of reads, which is just telling you to be aware when comparing the distributions of two different box plots (nt positions).
It isn't an error, just a warning (see my comments above). To answer the question though, "yes" (for the reasons stated above).
Because it is important that you know that not all of the box plots are composed of the same number of reads.
Hope that helps!
PS - the warning text above mentions this too --- "as a result, the plot at this position is not based on data from all the sequences..."