DADA2: The filter removed all reads for some samples

Hello, everyone.

I'm trying to use DADA2 to work with paired-end sequences of 16S V3-V4 region (338F/806R) from soil samples (Illumina MiSeq). I have removed primers and barcodes from the sequences from q2-cutadapt. And here are the quality plot of my sequences.


B_trimedSeq.qzv (319.8 KB)

After that, I run dada2 to denoising the sequences. For the extremely poor quality of reverse reads, I tried to trim more in the reverse length, but I need to ensure enough overlap region for merging. So I trimmed the length of 272 for forward reads, and 231, 217 for reverse reads.

Here are my commands:

time qiime dada2 denoise-paired \
  --i-demultiplexed-seqs ./A_trimedSeq.qza \
  --p-trim-left-f 0 \
  --p-trim-left-r 0 \
  --p-trunc-len-f 277 \
  --p-trunc-len-r 203 \
  --p-n-threads 20 \
  --o-table ./DADA2/dada2_Tab.qza \
  --o-representative-sequences ./DADA2/dada2_Reps.qza \
  --o-denoising-stats ./DADA2/dada2_Stats.qza
  --verbose

Here are some outputs:

1) Filtering The filter removed all reads: /tmp/tmpl9lu_p1_/filt_f/C1_0_L001_R1_001.fastq.gz and /tmp/tmpl9lu_p1_/filt_r/C1_126_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpl9lu_p1_/filt_f/C110_109_L001_R1_001.fastq.gz and /tmp/tmpl9lu_p1_/filt_r/C110_235_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpl9lu_p1_/filt_f/C122_121_L001_R1_001.fastq.gz and /tmp/tmpl9lu_p1_/filt_r/C122_247_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpl9lu_p1_/filt_f/C24_23_L001_R1_001.fastq.gz and /tmp/tmpl9lu_p1_/filt_r/C24_149_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpl9lu_p1_/filt_f/C30_29_L001_R1_001.fastq.gz and /tmp/tmpl9lu_p1_/filt_r/C30_155_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpl9lu_p1_/filt_f/C53_52_L001_R1_001.fastq.gz and /tmp/tmpl9lu_p1_/filt_r/C53_178_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpl9lu_p1_/filt_f/C7_6_L001_R1_001.fastq.gz and /tmp/tmpl9lu_p1_/filt_r/C7_132_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpl9lu_p1_/filt_f/C47_46_L001_R1_001.fastq.gz and /tmp/tmpl9lu_p1_/filt_r/C47_172_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpl9lu_p1_/filt_f/C41_40_L001_R1_001.fastq.gz and /tmp/tmpl9lu_p1_/filt_r/C41_166_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpl9lu_p1_/filt_f/C70_69_L001_R1_001.fastq.gz and /tmp/tmpl9lu_p1_/filt_r/C70_195_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpl9lu_p1_/filt_f/C64_63_L001_R1_001.fastq.gz and /tmp/tmpl9lu_p1_/filt_r/C64_189_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpl9lu_p1_/filt_f/C76_75_L001_R1_001.fastq.gz and /tmp/tmpl9lu_p1_/filt_r/C76_201_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpl9lu_p1_/filt_f/C99_98_L001_R1_001.fastq.gz and /tmp/tmpl9lu_p1_/filt_r/C99_224_L001_R2_001.fastq.gz not written.
Some input samples had no reads pass the filter.

By the way, I use QIIME2 2021.4 installed by conda in a linux platform.

I've read similar posts in our formula, most of which suggested to trimmed strictly for sequences length. However, it seemed not to work for me. So I upload my questions and results here and ask for help. I'm so depressed now, and Can anybody give me some suggestions? Thank you very much in advance.

Sirong

Hello Sirong,

I think you are doing a good job, and are on the right track. This is good:

Let's calculate the area of overlap you would expect based on these trimming parameters. (Here's how this calculation works, if that's helpful.)

length of amplicon = end (reverse primer) - start (forward primer)
length of amplicon = 806 - 338
length of amplicon = 468

(forward read) + (reverse read) - (length of amplicon) = overlap
277 forward + 203 reverse - 468 amplicon = overlap
-38 = overlap

So, a negative overlap means there is no overlap at all! Instead, you have a gap between reads of 38 basepairs.

This explains why no reads passed the filter.

To get your reads to join, you will have increase your truncation length so the reads overlap a little bit. Try running dada2 again, with new truncation settings, like these:

  --p-trunc-len-f 277 \
  --p-trunc-len-r 255 \

You are correct that it's usually best to trim off low quality regions before joining. But your V3-V4 region is very long, so your reads must also be long if you want to join them.

If the area of overlap is too low quality to join, you can still continue your analysis using just your forward reads! Because the quality of your forward read1 is very good, consider using dada2 denoise-single, which does not require a second read or any joining step.

Let us know what you try next. We are always here to help.

1 Like

Hello Colin,

Thank you very much for your suggestions! I tried several times, and show here the corresponding results. May you give me further suggestions because there are still some confusing points.

  1. I tried to lengthen the trimmed length of reversed sequences that have removed primers. I tried the length of 255, 231, 212.

    time qiime dada2 denoise-paired
    --i-demultiplexed-seqs ./B_trimedSeq.qza
    --p-trim-left-f 0
    --p-trim-left-r 0
    --p-trunc-len-f 271
    --p-trunc-len-r 212
    --p-n-threads 20
    --o-table ./DADA2/dada2_Tab2.qza
    --o-representative-sequences ./DADA2/dada2_Reps2.qza
    --o-denoising-stats ./DADA2/dada2_Stats2.qza
    --verbose

Unfortunately, for all candidate length values, there were always several samples that filter all sequences at the first step of dada2. The length values of 255 and 231 filtered 39/126 samples while 212 filtered 17/126 samples. Owing to the requirement of overlapping regions, I didn't further decrease the length value.

  • screenshot of dada2 summary for length values of 255 and 231

  • screenshot of dada2 summary for length value of 212
  1. Because for the sequences with barcode and primers (about 20bp), the quality of the start regions (non-biological part) is better than the end region. So I was thinking, can I preserve primers and trim more length for quality control. So I tried to do dada2 for the original sequences without barcode and primer removal.
  • the quality plot of sequences without barcode and primers removal
    B_rawdata.qzv (314.9 KB)

According to the quality plots, I trimmed 5bp for the left and more for the end nucleotides. For this case, all samples were preserved. However, the percentage of input merged and the percentage of input non-chimeric is so low, about 20% - 40%. I'm not sure the values are suitable. I tried to trim less length of the sequence to ensure enough merging overlap, but the results were even worse. So I'm confused, and can you give me some advice?

time qiime dada2 denoise-paired \
  --i-demultiplexed-seqs ./rawdata/B_rawdata.qza \
  --p-trim-left-f 5 \
  --p-trim-left-r 5 \
  --p-trunc-len-f 296 \
  --p-trunc-len-r 212 \
  --p-n-threads 20 \
  --o-table ./DADA2/dada2_Tab3.qza \
  --o-representative-sequences ./DADA2/dada2_Reps3.qza \
  --o-denoising-stats ./DADA2/dada2_Stats3.qza \
  --verbose

  1. According to your advice, I also tried to do dada2 with only forward reads. For this, the results seem better than those from paired sequences.

    time qiime dada2 denoise-single
    --i-demultiplexed-seqs ./B_trimedSeq_sig.qza
    --p-trim-left 0
    --p-trunc-len 270
    --p-n-threads 20
    --o-table ./DADA2/dada2_Tab_sig.qza
    --o-representative-sequences ./DADA2/dada2_Reps_sig.qza
    --o-denoising-stats ./DADA2/dada2_Stats_sig.qza
    --verbose

But I'm thinking, what sequences I aim to amplify is V3-V4 with 468bp. If I only use the forward reads (only about 270bp after trimming primer and barcode) to do the following analyses, would it be credible to do the taxonomy assignment and other analyses? Would you give me some suggestions?

Thank you again in advance.

Sirong

Hello Sirong,

Thanks for trying those different length values. While they did not work well, they did confirm that we need very long reads to join the full length amplicon. (This may be a reason to use V4 amplicon, insead of V3-V4 in the future, as the 250 bp V4 amplicon is much easier to cover with paired-end reads.)

I found this section very interesting:

Because the barcode and primer is near the start of your forward read, you can chose not to trim it before running dada2. However, this does not change how much your reads will overlap, so we still have problems joining the reads.

|primer------------------>                        R1
                       <------------------------| R2
      |------------------>                        R1 no primer
^no primer             <------------------------| R2
                       ^^^ same issue with joining

I should comment on this as well:

The q2-dada2 plugin will only join if all basepairs in the area of overlap are an exact match. But with the quality at the end of R2, there are too many differences to join these reads. :crying_cat_face:

However, exact matches between joined reads are not always needed! If you run DADA2 in R or use qiime vsearch join-pairs, then you can allow some mismatches between the two reads, which is especially important when joining long reads with this quality.

Have you worked with R before? Is so, try running dada2 directly!


Those results look great! That's what we wanted to see with paired-end reads!

Yes. Your forward reads are basically just the V3 region, which is fine. It will be shorter than V3-V4, and that will have less taxonomic resolution, but it will also be higher quality and avoid any bias due to pairing.

Let me know what you try next. You are making very good progress!

Colin :whale2:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.