Dada2 filtered out 5 out of 30 samples in loop fashion

Hi there!
Thanks for awesome software. :slight_smile:

We are facing problem with dada2 processing batches.

In last experiment we denoised 180 samples in one run.
Upstream processing worked well on those samples and passed our QC tests.
Steps following dada2 are also working as expected.

However, in these 180 samples - if we divide them to groups of 30 samples, first 25 of them were processed and filtered correctly. Last 5 samples inside this group are filtered to 0 reads.
Samples 31 - 55 are correct, 56-60 are filtered to 0. 61-85 - OK, 86-90 are zeros. And so on…

In batch:

sample-id input filtered percentage of input passed filter denoised merged percentage of input merged non-chimeric percentage of input non-chimeric
#q2:types numeric numeric numeric numeric numeric numeric numeric numeric
Ban.R1 34415 25093 72.91 24950 24507 71.21 24489 71.16
Ban.R2 46916 35372 75.39 35150 34673 73.9 34299 73.11
Ban.R3 68330 52604 76.99 52428 51988 76.08 50758 74.28
Ban.R4 26027 17160 65.93 16944 16279 62.55 15848 60.89
Ban.R5 30320 23592 77.81 23427 23152 76.36 23065 76.07
Ban.R6 42462 30830 72.61 30579 30060 70.79 29651 69.83
Ban.R7 45445 29816 65.61 29534 28403 62.5 27976 61.56
Ban.R8 77493 47258 60.98 47033 37531 48.43 34393 44.38
Ban.R9 63696 47433 74.47 47258 45982 72.19 45629 71.64
Ban.R10 18971 12111 63.84 11876 9797 51.64 9787 51.59
Ban.R11 127810 80674 63.12 80496 79147 61.93 74780 58.51
Ban.R12 20333 13394 65.87 13125 12524 61.59 12495 61.45
Ban.R13 23739 15662 65.98 15388 14093 59.37 13733 57.85
Ban.R14 66661 44563 66.85 44378 41694 62.55 39403 59.11
Ban.R15 57505 41538 72.23 41330 38517 66.98 33662 58.54
Ban.R16 36302 27728 76.38 27532 26985 74.33 26703 73.56
Ban.R17 79508 60794 76.46 60512 59712 75.1 53463 67.24
Ban.R18 15382 10091 65.6 9867 9417 61.22 9412 61.19
Ban.R19 16431 12351 75.17 12115 11763 71.59 11743 71.47
Ban.R20 25313 17781 70.24 17604 17197 67.94 17193 67.92
Ban.R21 72790 46279 63.58 46056 45713 62.8 45192 62.09
Ban.R22 20664 14136 68.41 13842 12806 61.97 12805 61.97
Ban.R23 43694 30971 70.88 30845 30674 70.2 30425 69.63
Ban.R24 3565 2315 64.94 2208 1948 54.64 1939 54.39
Ban.R25 70223 47477 67.61 47193 44187 62.92 42290 60.22
Ban.R26 48225 0 0 0 0 0 0 0
Ban.R27 50671 0 0 0 0 0 0 0
Ban.R28 50430 0 0 0 0 0 0 0
Ban.R29 20805 0 0 0 0 0 0 0
Ban.R30 32571 0 0 0 0 0 0 0

So I tested if the data itself is correct and imported it as a separate analysis. The same sample that was filtered to 0 reads in batch, is filtered as expected to about 70% of input reads - when processed alone.

Separate analysis:

sample-id input filtered percentage of input passed filter denoised merged percentage of input merged non-chimeric percentage of input non-chimeric
#q2:types numeric numeric numeric numeric numeric numeric numeric numeric
Ban.R27 50671 36496 72.03 36290 35948 70.94 32277 63.7

We are using qiime2-2021.2 installed by conda (4.10.1). We checked if problem sustains in qiime2-2021.4 and it is still present. System is Ubuntu 20.04.2 LTS.

Could you please advise? Quick solution is to process it separately, but it is not sustainable solution.

qiime dada2 denoise-paired \
  --verbose \
  --i-demultiplexed-seqs tmp/qiime2/reads.qza \
  --p-trunc-len-f 0 \
  --p-trunc-len-r 270 \
  --p-n-threads 0 \
  --p-no-hashed-feature-ids \
  --o-table tmp/qiime2/table-dada2.qza \
  --o-representative-sequences tmp/qiime2/rep-seqs-dada2.qza \
  --o-denoising-stats tmp/qiime2/dada2-stats.qza

From verbose output it seems there is problem with
DADA2: 1.18.0 / Rcpp: 1.0.6 / RcppParallel: 5.0.2
and not writing those reads.

verbose output:
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /tmp/tmpahga032z/forward /tmp/tmpahga032z/reverse /tmp/tmpahga032z/output.tsv.biom /tmp/tmpahga032z/track.tsv /tmp/tmpahga032z/filt_f /tmp/tmpahga032z/filt_r 0 270 0 0 2.0 2.0 2 independent consensus 1.0 0 1000000

R version 4.0.2 (2020-06-22) 
Loading required package: Rcpp
DADA2: 1.18.0 / Rcpp: 1.0.6 / RcppParallel: 5.0.2 
1) Filtering The filter removed all reads: /tmp/tmpahga032z/filt_f/C69_178_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/C69_358_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R119_118_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R119_298_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/C67_176_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/C67_356_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/C66_175_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/C66_355_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R117_116_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R117_296_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R116_115_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R116_295_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/C68_177_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/C68_357_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/C70_179_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/C70_359_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R118_117_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R118_297_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R27_26_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R27_206_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R120_119_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R120_299_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R28_27_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R28_207_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R149_148_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R149_328_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R150_149_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R150_329_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R146_145_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R146_325_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R26_25_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R26_205_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R148_147_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R148_327_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R147_146_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R147_326_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R29_28_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R29_208_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R30_29_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R30_209_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R57_56_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R57_236_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R56_55_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R56_235_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R59_58_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R59_238_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R90_89_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R90_269_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R58_57_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R58_237_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R60_59_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R60_239_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R88_87_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R88_267_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R86_85_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R86_265_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R87_86_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R87_266_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R89_88_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R89_268_L001_R2_001.fastq.gz not written.
Some input samples had no reads pass the filter.
.........................xxxxx...................xxxx.x...........................xxxx.x...........xxxx.x...........................xxxx.x...........................xxxx.x.........

Hi there @valzip!

The short answer is that the training model is based on the first “n” reads in your dataset - the default is 1,000,000. If you change the ordering of the samples, your model will be trained with slightly different data, which is probably what you’re observing here. A related post:

I want to make sure you’re aware that DADA2 expects to be executed on a per-sequencing-run basis. Its okay to subdivide a run into multiple execution batches, but combining sequencing runs should be avoided!

I hope that helps.

:qiime2:

Hi Matthew @thermokarst , thanks for answer! :slightly_smiling_face:

I would agree with this if it was just a subtle change in numbers e.g.: that from 20664 reads 14136 were passing the filter instead of 15001. Here we are missing all reads in several samples.

Today I tested our workflow on different machine and “error” persisted even on another system. However, I made a mistake and copied step with lower --p-trunc-len-r number and all samples were processed correctly.

I found out that there were some reads shorter than declared “–p-trunc-len-r 270” and that halted the filtering step in R analysis in qiime.

I fiddled with this setting and found out that we were just one nucleotide off.
–p-trunc-len-r #

259
....................................................................................................................................................................................
264
....................................................................................................................................................................................
269
....................................................................................................................................................................................
270
.........................xxxxx...................xxxx.x...........................xxxx.x...........xxxx.x...........................xxxx.x...........................xxxx.x.........
271
.........................xxxxx...................xxxx.x...........................xxxx.x...........xxxx.x...........................xxxx.x...........................xxxx.x.........
272
..........xxxxx.....xxxxxxxxxx...xxxxx....x.xxxxxxxxxxx.........x.xxxxx....x.xxxxxxxxxxx......xxxxxxxxx.x...........xxxxx......xxxxxxxxx.x...........xxxxx......xxxxxxxxx.x.........
273
.....xxxxxxxxxx.....xxxxxxxxxx.xxxxxxx....x.xxxxxxxxxxx.....xxxxxxxxxxx....x.xxxxxxxxxxx......xxxxxxxxx.x.....xxxx.xxxxxx......xxxxxxxxxxx.....xxxxxxxxxxx....x.xxxxxxxxxxx.....xxxx
289
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

I am a bit surprised that this action did not trigger any error message.

That being said, there is no error on the side of qiime2. There is some preferential trimming upstream by either je, or fastp. Thank you for your time. :+1:

Yes, we are aware of this but sometimes we have huge experiments (150 samples in single sequencing as we use dual barcoding). Thx