Demux summary interpretation and dada2 denoise-paired parameters

ylor · August 20, 2018, 8:15pm

Hi all,

I am struggling to understand and interpret my demux summary data from an Illumina NextSeq run (picture below). Is it saying that there is no data from 0 to about 60 bases, therefore, I should set my dada2 denoise-paired parameter --p-trim-left-f to about 60 or 70? How come I hardly see any boxes in the box and whiskers plots? Is that a bad thing?

Thanks!

thermokarst · August 21, 2018, 1:02pm

Hey there @ylor!

Not quite! It is saying that "of the sequences subsampled for the this visualization, positions 0 to 60 had no variation in the quality score - they were all identical." It looks like after the first 6 or 7 nts, the quality score is exactly 31 or 32 (you can confirm by hovering over a "box" there, the table below will update with the relevant distribution).

Hopefully what I wrote above helps explain that. As the viz mentions above, you can click and drag to zoom in on an area of interest, too.

Not necessarily, although these profiles in general look suspicious to me - we normally see much more variation in quality scores in "unadulterated" Illumina reads. If I had to guess, I would assume that some form of quality filtering was applied prior to this step. If that was the case, then using a tool like DADA2 might not make the most sense for you, since that method works best with the "rawest" data, to generate its own error profile.

Hope that helps! :qiime2:

ylor · August 22, 2018, 1:07pm

Thank you for your quick response @thermokarst.

The only thing done prior to importing was demultiplexing using bcl2fastq.

I tried running dada2 denoise-paired on the data anyway and got the following:

qiime dada2 denoise-paired --verbose \
--i-demultiplexed-seqs ./demux-paired-end.qza \
--p-trunc-len-f 145 \
--p-trunc-len-r 145 \
--p-trim-left-f 10 \
--p-trim-left-r 10 \
--o-denoising-stats ./dada2-stats.qza \
--o-representative-sequences ./rep-seqs-dada2-paired.qza \
--o-table ./table-dada2-paired.qza \
--p-n-threads 38

Output:

Loading required package: Rcpp
Self-consistency loop terminated before convergence.
Duplicate sequences in merged output.
Duplicate sequences detected and merged.
The sequences being tabled vary in length.

R version 3.4.1 (2017-06-30)
DADA2 R package version: 1.6.0

Filtering ................................

Learning Error Rates
2a) Forward Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 2864212 reads in 836316 unique sequences.
selfConsist step 2
selfConsist step 3
selfConsist step 4
selfConsist step 5
selfConsist step 6
selfConsist step 7
selfConsist step 8
selfConsist step 9
selfConsist step 10
2b) Reverse Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 2864212 reads in 857319 unique sequences.
selfConsist step 2
selfConsist step 3
selfConsist step 4
selfConsist step 5
selfConsist step 6
selfConsist step 7
selfConsist step 8
selfConsist step 9
selfConsist step 10

Denoise remaining samples ...............................

Remove chimeras (method = consensus)

Write output
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /var/lib/condor/execute/dir_931625/tmps9r8uxvi/forward /var/lib/condor/execute/dir_931625/tmps9r8uxvi/reverse /var/lib/condor/execute/dir_931625/tmps9r8uxvi/output.tsv.biom /var/lib/condor/execute/dir_931625/tmps9r8uxvi/track.tsv /var/lib/condor/execute/dir_931625/tmps9r8uxvi/filt_f /var/lib/condor/execute/dir_931625/tmps9r8uxvi/filt_r 145 145 10 10 2.0 2 consensus 1.0 38 1000000

Several questions come to mind based on the dada2 outputs:

How does dada2 handle duplicate sequences?
How do I handle sequences of varying lengths? I assume the sequences that are not merged are just dropped and there's nothing I can do about that?
Seeing as this analysis took about 7.5 days to complete for these 32 samples, would you recommend running these samples individually with dada2 denoise-paired then combining all outputs together at the end? Or would that not be advised because all these samples were ran on the same sequencing run?

Thanks!

thermokarst · August 23, 2018, 2:25pm

Hmm, I would bet my bottom dollar that these reads were modified prior to generating the demux summarize viz. They aren't from Mr. DNA, are they?

Have you had a chance to review any of the DADA2 documentation? I believe all of these questions are covered there. Thanks! :qiime2:

ylor · August 23, 2018, 4:21pm

These data were generated with our in-house NextSeq machine. I took the raw data and ran it using the following bcl2fastq:

nohup /usr/local/bin/bcl2fastq 
--runfolder-dir /illumina/runs/Runs/
--output dir/illumina/runs/Runs/bcl2fastq_output_no_lane_splitting
--interop-dir /illumina/runs/Runs/
--sample-sheet /illumina/runs/Runs/SampleSheet.csv
--no-lane-splitting &

From there, I renamed the files so that they were all casava1.8 format compatible... Would it have mattered if I did

I'll have to re-read the DADA2 documentation again more carefully. Thanks!

system · September 23, 2018, 10:28pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.