DADA2 denoise-paired Filtering Error

Faisal · June 11, 2018, 1:38pm

Hi All,

I'm running DADA2 from QIIME2-2018.2 with this command line

nohup qiime dada2 denoise-paired \
    --i-demultiplexed-seqs paried_alldata_paired-end-demux.qza \
    --o-table table-dada2_paired_alldata.qza \
    --o-representative-sequences rep-seqs-dada2_paired_alldata.qza \
    --p-trunc-len-f 240 \
    --p-trunc-len-r 240 \
    --verbose \
    --p-n-threads 0 > dada2.log &

but I always get this error

R version 3.4.1 (2017-06-30)
Loading required package: Rcpp
DADA2 R package version: 1.6.0
1) Filtering Error in filterAndTrim(unfiltsF, filtsF, unfiltsR, filtsR, truncLen = c(truncLenF,  :
  These are the errors (up to 5) encountered in individual cores...
Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0,  :
  Mismatched forward and reverse sequence files: 343, 15538.
Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0,  :
  Mismatched forward and reverse sequence files: 343, 15538.
Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0,  :
  Mismatched forward and reverse sequence files: 343, 15538.
Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0,  :
  Mismatched forward and reverse sequence files: 343, 15538.
Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0,  :
  Mismatched forward and reverse sequence files: 343, 15538.
Execution halted
Plugin error from dada2:

  An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /tmp/tmp3dwh4cjv/forward /tmp/tmp3dwh4cjv/reverse /tmp/tmp3dwh4cjv/output.tsv.biom /tmp/tmp3dwh4cjv/filt_f /tmp/tmp3dwh4cjv/filt_r 240 240 0 0 2.0 2 consensus 1.0 0 1000000

Any help please with this error? I run DADA2 with smaller data set but I did not get this error. But with using trimming parameter which I avoid here.

Regards,

thermokarst · June 12, 2018, 7:07pm

Hey there @Faisal!

It looks like one of your samples has way more reverse reads than forward reads. The count of reads per direction should be identical.

How did you import these data? Did you demultiplex them yourself? Some more info here will help us figure out where to go next! Thanks! :qiime2:

Faisal · June 12, 2018, 7:45pm

Hi @thermokarst

The data received from sequencing facility demultiplexed. I imported all of them by manifest with this command

nohup qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path paired_w16_w28_manifest.csv \
  --output-path paried_w16_w28_paired-end-demux.qza \
  --source-format PairedEndFastqManifestPhred33 &

Around half of these data set sequenced on different time.
The summary of imported data attached.

Thanks
demux_summary.qzv (293.8 KB)

thermokarst · June 12, 2018, 7:49pm

Can you share your manifest file? I suspect there might be a mismatch there - a forward might be mapped to the wrong reverse. Thanks!

Faisal · June 12, 2018, 7:53pm

Attached, Thanks
paired_import_manifest.csv (65.4 KB)

Faisal · June 13, 2018, 8:11pm

Hi all,
Is there any idea how to solve this issue?

By the way, I tried to run around half of this dataset and dada2 is going fine . Is the dataset size a reason for this issue ?

Thanks,

Faisal · June 15, 2018, 10:31am

Hi,

I noticed that 343 is number of sequence of one of sample, which was the lowest. I checked the forward and reverse files and noticed quite big different in files size. The forward around 44KB but reveres around 2.5MB. I removed both, new manifest made, and imported to new artifact.
With DADA2, the run go through but with this message appear:

R version 3.4.1 (2017-06-30)
Loading required package: Rcpp
DADA2 R package version: 1.6.0
1) Filtering .......................................................................................................................................................................................................................
2) Learning Error Rates
Not all sequences were the same length.
Not all sequences were the same length.
Not all sequences were the same length.
.
.
.
2a) Forward Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 90919 reads in 38359 unique sequences.
Sample 2 - 80439 reads in 31582 unique sequences.
Sample 3 - 72901 reads in 33815 unique sequences.
Sample 4 - 79804 reads in 39592 unique sequences.
Sample 5 - 80884 reads in 36677 unique sequences.
Sample 6 - 72550 reads in 36566 unique sequences.
Sample 7 - 86889 reads in 37105 unique sequences.
Sample 8 - 75139 reads in 32848 unique sequences.
Sample 9 - 56918 reads in 25322 unique sequences.
Sample 10 - 61926 reads in 26031 unique sequences.
Sample 11 - 42135 reads in 15584 unique sequences.
Sample 12 - 59525 reads in 21987 unique sequences.
Sample 13 - 806 reads in 704 unique sequences.
Sample 14 - 68152 reads in 28985 unique sequences.
Sample 15 - 69211 reads in 27178 unique sequences.
Sample 16 - 89791 reads in 37242 unique sequences.
   selfConsist step 2
   selfConsist step 3
   selfConsist step 4
   selfConsist step 5
   selfConsist step 6
   selfConsist step 7
   selfConsist step 8
Convergence after  8  rounds.
2b) Reverse Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 90919 reads in 51254 unique sequences.
Sample 2 - 80439 reads in 42135 unique sequences.
Sample 3 - 72901 reads in 43517 unique sequences.
Sample 4 - 79804 reads in 49069 unique sequences.
Sample 5 - 80884 reads in 49603 unique sequences.
Sample 6 - 72550 reads in 45239 unique sequences.
Sample 7 - 86889 reads in 48509 unique sequences.
Sample 8 - 75139 reads in 42577 unique sequences.
Sample 9 - 56918 reads in 31360 unique sequences.
Sample 10 - 61926 reads in 34169 unique sequences.
Sample 11 - 42135 reads in 22626 unique sequences.
Sample 12 - 59525 reads in 29951 unique sequences.
Sample 13 - 806 reads in 747 unique sequences.
Sample 14 - 68152 reads in 38565 unique sequences.
Sample 15 - 69211 reads in 36730 unique sequences.
Sample 16 - 89791 reads in 47386 unique sequences.
   selfConsist step 2
   selfConsist step 3
   selfConsist step 4
   selfConsist step 5
   selfConsist step 6
   selfConsist step 7
Convergence after  7  rounds.

3) Denoise remaining samples Not all sequences were the same length.
Not all sequences were the same length.
.Not all sequences were the same length.
.
.

This run still going, but at least it is not stop from first stage as used before.
I split the two batch of sequencing to two different runs.
This issue only happen with first batch of sequencing, but the recent sequencing running without showing this error "Not all sequences were the same length"

I should note here that I have the fastq files for these sequences unzipped and I did gzip for each file before import them as artifact. Is there possibility for files damages? I know that should not be happen, but I'm wondering about the reason here? Both batchs of sequencing done in same sequencing facility and with using similar protocol and primers.

Any ideas could help?

Thanks,

Faisal · June 15, 2018, 10:43am

This is the log for dada2 run for last batch of sequenced samples without showing previous error in older sequenced samples.

R version 3.4.1 (2017-06-30)
Loading required package: Rcpp
DADA2 R package version: 1.6.0
1) Filtering .....................................................................................................
..................................................................................................................
.
2) Learning Error Rates
2a) Forward Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 39194 reads in 18072 unique sequences.
Sample 2 - 44820 reads in 14200 unique sequences.
Sample 3 - 95965 reads in 34938 unique sequences.
Sample 4 - 49969 reads in 20767 unique sequences.
Sample 5 - 63438 reads in 19303 unique sequences.
Sample 6 - 62350 reads in 18530 unique sequences.
Sample 7 - 49340 reads in 18911 unique sequences.
Sample 8 - 60695 reads in 22343 unique sequences.
Sample 9 - 77311 reads in 22536 unique sequences.
Sample 10 - 64340 reads in 25432 unique sequences.
Sample 11 - 44313 reads in 16861 unique sequences.
Sample 12 - 59871 reads in 22531 unique sequences.
Sample 13 - 73882 reads in 32930 unique sequences.
Sample 14 - 59350 reads in 23047 unique sequences.
Sample 15 - 23171 reads in 8916 unique sequences.
Sample 16 - 70327 reads in 29284 unique sequences.
Sample 17 - 59912 reads in 20902 unique sequences.
Sample 18 - 31603 reads in 13802 unique sequences.
   selfConsist step 2
   selfConsist step 3
   selfConsist step 4
   selfConsist step 5
   selfConsist step 6
   selfConsist step 7
   selfConsist step 8
   selfConsist step 9
   selfConsist step 10
Self-consistency loop terminated before convergence.
2b) Reverse Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 39194 reads in 22616 unique sequences.
Sample 2 - 44820 reads in 20319 unique sequences.
Sample 3 - 95965 reads in 49275 unique sequences.
Sample 4 - 49969 reads in 26094 unique sequences.
Sample 5 - 63438 reads in 29221 unique sequences.
Sample 6 - 62350 reads in 29065 unique sequences.
Sample 7 - 49340 reads in 26150 unique sequences.
Sample 8 - 60695 reads in 31435 unique sequences.
Sample 9 - 77311 reads in 30663 unique sequences.
Sample 10 - 64340 reads in 33927 unique sequences.
Sample 11 - 44313 reads in 22315 unique sequences.
Sample 12 - 59871 reads in 30972 unique sequences.
Sample 13 - 73882 reads in 40548 unique sequences.
Sample 14 - 59350 reads in 30319 unique sequences.
Sample 15 - 23171 reads in 12529 unique sequences.
Sample 16 - 70327 reads in 37462 unique sequences.
Sample 17 - 59912 reads in 28615 unique sequences.
Sample 18 - 31603 reads in 18769 unique sequences.
   selfConsist step 2
   selfConsist step 3
   selfConsist step 4
   selfConsist step 5
   selfConsist step 6
   selfConsist step 7
Convergence after  7  rounds.

3) Denoise remaining samples ......................................................................................................................................................................................................
The sequences being tabled vary in length.
4) Remove chimeras (method = consensus)
                                 input filtered denoised merged non-chimeric
A001w28_0_L001_R1_001.fastq.gz   73992    39194    39194    389          285
A002w28_2_L001_R1_001.fastq.gz   87635    44820    44820     88           87
A003w28_4_L001_R1_001.fastq.gz  187455    95965    95965    243          235
A004w28_6_L001_R1_001.fastq.gz   89517    49969    49969     18           16
A005w28_8_L001_R1_001.fastq.gz  115993    63438    63438     44           44
A006w28_10_L001_R1_001.fastq.gz 129567    62350    62350     28           28
6) Write output
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /tmp/tmpa1f97v01/forward /tmp/tmpa1f97v01/reverse /tmp/tmpa1f97v01/output.tsv.biom /tmp/tmpa1f97v01/filt_f /tmp/tmpa1f97v01/filt_r 222 222 0 0 2.0 2 consensus 1.0 0 1000000

Saved FeatureTable[Frequency] to: table-dada2_paired_w28.qza
Saved FeatureData[Sequence] to: rep-seqs-dada2_paired_w28.qza

Is this related specifically to issue in samples sequencing from the facility?

thermokarst · June 15, 2018, 12:41pm

Hi there @Faisal - no the size of the dataset is not a problem.

in case you missed it above, this is the problem:

The next step is that I would look at the manifest, to make sure you didn't accidentally map the wrong files to a sample (e.g. sample_a forward and sample_b reverse mapped to sample_a). I skimmed the file and nothing jumped out at me.

The next step is I would check the file sizes:

Okay! Now we are onto something!

Perfect.

That is fine - these are two completely different messages. Our first error had to do with the number of forward and reverse reads, while the second has to do with the length of the reads themselves (how many nts long).

Good - noticed your manifest looked like maybe it was assembled from multiple runs. DADA2 should operate on one run at a time, then merge the artifacts later (see the FMT Tutorial for an example).

My guess is that there is no damage done (but its possible). More likely is that there was a renaming problem, especially if you did this manually. Please note - the fastq manifest does not need fastq.gz files - you can provide filepaths to fastq files and QIIME 2 will gzip them on import.

I don't think this is a batch effect problem - I think it is most likely a clerical issue related to renaming. You could also re-download those files from the sequencing center's server (or however you aquired them originally), and double check that sample's forward and reverse reads are the same length or not.

Sounds like you are all set - just let DADA2 keep on cooking and let us know how it goes! :qiime2:

Faisal · June 17, 2018, 6:42am

Hi @thermokarst ,

Thanks for reply,

The dada2 run finished, but the result unusual for me. After import all data, the total sequences number is 18,278,025 but after dada2 run the total sequences number is 1,159,115, which means around 94% of sequences reads removed ! For this run I did not used any trimming or truncate parameter. This is the command line used:

nohup qiime dada2 denoise-paired \
    --i-demultiplexed-seqs paired-end-demux.qza \
    --o-table table-dada2.qza \
    --o-representative-sequences rep-seqs-dada2.qza \
    --p-trunc-len-f 0 \
    --p-trunc-len-r 0 \
    --verbose \
    --p-n-threads 0 > dada2-no-turnc.log &

I completed the taxonomic analysis and training the classifier, and did some analysis with my metadata like correlation. But the number of positive samples with correlation very low, out of 210 samples I get around 10-15 positive samples. This is unusual for me at all and my results with QIIME 1 was far better. Is there critically wrong parameter I used with dada2 I don't realized?

Thanks for your patient and help.

Nicholas_Bokulich · June 18, 2018, 5:03pm

Based on your earlier dada2 log, the issue is that your forward and reverse reads are not overlapping sufficiently, causing a large number of reads to be dropped:

See how the number of reads decreases dramatically at the "merge" step?

A few questions:

what primers are you using?
what is the expected amplicon length?
what is the length of your forward and reverse reads?

You are not truncating your reads, but that may not be enough. You need a minimum of ~20 nt overlap between forward and reverse reads for merging to succeed.

If your reads are not long enough, your only choice may be to proceed only with the forward or reverse reads as single-end data.

With so many reads being filtered out, you should not proceed or attempt to interpret any downstream results until this is resolved.

I hope that helps!

Faisal · June 19, 2018, 12:49am

Hi @Nicholas_Bokulich,

I see the issue that reported by DADA2 log, and I'm wondering about it as I have no issue with forward and revers reads joining with QIIME v1.9.1.

what primers are you using?

926F/1392R target variable region: V6-V8. Illumina MiSeq system used for sequencing.

what is the expected amplicon length?

500 bps

what is the length of your forward and reverse reads?

300nt. However, I checked some samples files for both forward and revers, I noticed for one read for example forward the length is 301nt but for reverse is 300nt. I noticed this different with few reads I checked.

If your reads are not long enough, your only choice may be to proceed only with the forward or reverse reads as single-end data.

I think about this option, the forward is the best and higher quality. But with using the forward reads only, I might lost many useful data from my sequences.

Thanks,

Nicholas_Bokulich · June 19, 2018, 11:54pm

An off-topic reply has been split into a new topic: Dada2 mismatch error after merging 2 runs

Please keep replies on-topic in the future.

Nicholas_Bokulich · June 20, 2018, 12:00am

not to worry, we should be able to do the same here. I think I might see the problem:

If reads are different lengths after denoising, that might be causing issues with dada2. Why are you setting the trim lengths to zero? Try setting reasonable trim lengths based on your quality profiles and see if you still get reads dropping out.