I am struggling to understand and interpret my demux summary data from an Illumina NextSeq run (picture below). Is it saying that there is no data from 0 to about 60 bases, therefore, I should set my dada2 denoise-paired parameter --p-trim-left-f to about 60 or 70? How come I hardly see any boxes in the box and whiskers plots? Is that a bad thing?
Not quite! It is saying that "of the sequences subsampled for the this visualization, positions 0 to 60 had no variation in the quality score - they were all identical." It looks like after the first 6 or 7 nts, the quality score is exactly 31 or 32 (you can confirm by hovering over a "box" there, the table below will update with the relevant distribution).
Hopefully what I wrote above helps explain that. As the viz mentions above, you can click and drag to zoom in on an area of interest, too.
Not necessarily, although these profiles in general look suspicious to me - we normally see much more variation in quality scores in "unadulterated" Illumina reads. If I had to guess, I would assume that some form of quality filtering was applied prior to this step. If that was the case, then using a tool like DADA2 might not make the most sense for you, since that method works best with the "rawest" data, to generate its own error profile.
Loading required package: Rcpp
Self-consistency loop terminated before convergence.
Duplicate sequences in merged output.
Duplicate sequences detected and merged.
The sequences being tabled vary in length.
R version 3.4.1 (2017-06-30)
DADA2 R package version: 1.6.0
Write output
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
Several questions come to mind based on the dada2 outputs:
How does dada2 handle duplicate sequences?
How do I handle sequences of varying lengths? I assume the sequences that are not merged are just dropped and there's nothing I can do about that?
Seeing as this analysis took about 7.5 days to complete for these 32 samples, would you recommend running these samples individually with dada2 denoise-paired then combining all outputs together at the end? Or would that not be advised because all these samples were ran on the same sequencing run?