Dada2 filtered out 5 out of 30 samples in loop fashion

valzip · May 21, 2021, 3:47pm

Hi there!
Thanks for awesome software.

We are facing problem with dada2 processing batches.

In last experiment we denoised 180 samples in one run.
Upstream processing worked well on those samples and passed our QC tests.
Steps following dada2 are also working as expected.

However, in these 180 samples - if we divide them to groups of 30 samples, first 25 of them were processed and filtered correctly. Last 5 samples inside this group are filtered to 0 reads.
Samples 31 - 55 are correct, 56-60 are filtered to 0. 61-85 - OK, 86-90 are zeros. And so on...

In batch:

sample-id	input	filtered	percentage of input passed filter	denoised	merged	percentage of input merged	non-chimeric	percentage of input non-chimeric
#q2:types	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric
Ban.R1	34415	25093	72.91	24950	24507	71.21	24489	71.16
Ban.R2	46916	35372	75.39	35150	34673	73.9	34299	73.11
Ban.R3	68330	52604	76.99	52428	51988	76.08	50758	74.28
Ban.R4	26027	17160	65.93	16944	16279	62.55	15848	60.89
Ban.R5	30320	23592	77.81	23427	23152	76.36	23065	76.07
Ban.R6	42462	30830	72.61	30579	30060	70.79	29651	69.83
Ban.R7	45445	29816	65.61	29534	28403	62.5	27976	61.56
Ban.R8	77493	47258	60.98	47033	37531	48.43	34393	44.38
Ban.R9	63696	47433	74.47	47258	45982	72.19	45629	71.64
Ban.R10	18971	12111	63.84	11876	9797	51.64	9787	51.59
Ban.R11	127810	80674	63.12	80496	79147	61.93	74780	58.51
Ban.R12	20333	13394	65.87	13125	12524	61.59	12495	61.45
Ban.R13	23739	15662	65.98	15388	14093	59.37	13733	57.85
Ban.R14	66661	44563	66.85	44378	41694	62.55	39403	59.11
Ban.R15	57505	41538	72.23	41330	38517	66.98	33662	58.54
Ban.R16	36302	27728	76.38	27532	26985	74.33	26703	73.56
Ban.R17	79508	60794	76.46	60512	59712	75.1	53463	67.24
Ban.R18	15382	10091	65.6	9867	9417	61.22	9412	61.19
Ban.R19	16431	12351	75.17	12115	11763	71.59	11743	71.47
Ban.R20	25313	17781	70.24	17604	17197	67.94	17193	67.92
Ban.R21	72790	46279	63.58	46056	45713	62.8	45192	62.09
Ban.R22	20664	14136	68.41	13842	12806	61.97	12805	61.97
Ban.R23	43694	30971	70.88	30845	30674	70.2	30425	69.63
Ban.R24	3565	2315	64.94	2208	1948	54.64	1939	54.39
Ban.R25	70223	47477	67.61	47193	44187	62.92	42290	60.22
Ban.R26	48225	0	0	0	0	0	0	0
Ban.R27	50671	0	0	0	0	0	0	0
Ban.R28	50430	0	0	0	0	0	0	0
Ban.R29	20805	0	0	0	0	0	0	0
Ban.R30	32571	0	0	0	0	0	0	0

So I tested if the data itself is correct and imported it as a separate analysis. The same sample that was filtered to 0 reads in batch, is filtered as expected to about 70% of input reads - when processed alone.

Separate analysis:

sample-id	input	filtered	percentage of input passed filter	denoised	merged	percentage of input merged	non-chimeric	percentage of input non-chimeric
#q2:types	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric
Ban.R27	50671	36496	72.03	36290	35948	70.94	32277	63.7

We are using qiime2-2021.2 installed by conda (4.10.1). We checked if problem sustains in qiime2-2021.4 and it is still present. System is Ubuntu 20.04.2 LTS.

Could you please advise? Quick solution is to process it separately, but it is not sustainable solution.

qiime dada2 denoise-paired \
  --verbose \
  --i-demultiplexed-seqs tmp/qiime2/reads.qza \
  --p-trunc-len-f 0 \
  --p-trunc-len-r 270 \
  --p-n-threads 0 \
  --p-no-hashed-feature-ids \
  --o-table tmp/qiime2/table-dada2.qza \
  --o-representative-sequences tmp/qiime2/rep-seqs-dada2.qza \
  --o-denoising-stats tmp/qiime2/dada2-stats.qza

From verbose output it seems there is problem with
DADA2: 1.18.0 / Rcpp: 1.0.6 / RcppParallel: 5.0.2
and not writing those reads.

verbose output:
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /tmp/tmpahga032z/forward /tmp/tmpahga032z/reverse /tmp/tmpahga032z/output.tsv.biom /tmp/tmpahga032z/track.tsv /tmp/tmpahga032z/filt_f /tmp/tmpahga032z/filt_r 0 270 0 0 2.0 2.0 2 independent consensus 1.0 0 1000000

R version 4.0.2 (2020-06-22) 
Loading required package: Rcpp
DADA2: 1.18.0 / Rcpp: 1.0.6 / RcppParallel: 5.0.2 
1) Filtering The filter removed all reads: /tmp/tmpahga032z/filt_f/C69_178_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/C69_358_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R119_118_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R119_298_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/C67_176_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/C67_356_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/C66_175_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/C66_355_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R117_116_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R117_296_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R116_115_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R116_295_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/C68_177_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/C68_357_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/C70_179_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/C70_359_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R118_117_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R118_297_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R27_26_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R27_206_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R120_119_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R120_299_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R28_27_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R28_207_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R149_148_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R149_328_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R150_149_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R150_329_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R146_145_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R146_325_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R26_25_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R26_205_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R148_147_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R148_327_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R147_146_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R147_326_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R29_28_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R29_208_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R30_29_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R30_209_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R57_56_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R57_236_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R56_55_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R56_235_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R59_58_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R59_238_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R90_89_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R90_269_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R58_57_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R58_237_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R60_59_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R60_239_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R88_87_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R88_267_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R86_85_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R86_265_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R87_86_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R87_266_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpahga032z/filt_f/Ban.R89_88_L001_R1_001.fastq.gz and /tmp/tmpahga032z/filt_r/Ban.R89_268_L001_R2_001.fastq.gz not written.
Some input samples had no reads pass the filter.
.........................xxxxx...................xxxx.x...........................xxxx.x...........xxxx.x...........................xxxx.x...........................xxxx.x.........

thermokarst · May 24, 2021, 9:51pm

Hi there @valzip!

The short answer is that the training model is based on the first "n" reads in your dataset - the default is 1,000,000. If you change the ordering of the samples, your model will be trained with slightly different data, which is probably what you're observing here. A related post:

I want to make sure you're aware that DADA2 expects to be executed on a per-sequencing-run basis. Its okay to subdivide a run into multiple execution batches, but combining sequencing runs should be avoided!

I hope that helps.

:qiime2:

valzip · May 25, 2021, 1:25pm

Hi Matthew @thermokarst , thanks for answer!

I would agree with this if it was just a subtle change in numbers e.g.: that from 20664 reads 14136 were passing the filter instead of 15001. Here we are missing all reads in several samples.

Today I tested our workflow on different machine and "error" persisted even on another system. However, I made a mistake and copied step with lower --p-trunc-len-r number and all samples were processed correctly.

I found out that there were some reads shorter than declared "--p-trunc-len-r 270" and that halted the filtering step in R analysis in qiime.

I fiddled with this setting and found out that we were just one nucleotide off.
--p-trunc-len-r #

259
....................................................................................................................................................................................
264
....................................................................................................................................................................................
269
....................................................................................................................................................................................
270
.........................xxxxx...................xxxx.x...........................xxxx.x...........xxxx.x...........................xxxx.x...........................xxxx.x.........
271
.........................xxxxx...................xxxx.x...........................xxxx.x...........xxxx.x...........................xxxx.x...........................xxxx.x.........
272
..........xxxxx.....xxxxxxxxxx...xxxxx....x.xxxxxxxxxxx.........x.xxxxx....x.xxxxxxxxxxx......xxxxxxxxx.x...........xxxxx......xxxxxxxxx.x...........xxxxx......xxxxxxxxx.x.........
273
.....xxxxxxxxxx.....xxxxxxxxxx.xxxxxxx....x.xxxxxxxxxxx.....xxxxxxxxxxx....x.xxxxxxxxxxx......xxxxxxxxx.x.....xxxx.xxxxxx......xxxxxxxxxxx.....xxxxxxxxxxx....x.xxxxxxxxxxx.....xxxx
289
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

I am a bit surprised that this action did not trigger any error message.

That being said, there is no error on the side of qiime2. There is some preferential trimming upstream by either je, or fastp. Thank you for your time.

Yes, we are aware of this but sometimes we have huge experiments (150 samples in single sequencing as we use dual barcoding). Thx

system · June 25, 2021, 7:25pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.