Dada2: good quality score plots, but few sequences getting through!

We ran the following command for demultilpexing our reads:

qiime cutadapt demux-paired
--i-seqs multiplexed-seqs.qza
--m-forward-barcodes-file metadata.tsv
--m-forward-barcodes-column BarcodeSequence
--p-error-rate 0
--o-per-sample-sequences demux.qza
--o-untrimmed-sequences untrimmed.qza
--verbose

The quality score plots are shown below---we were quite happy with these. We had more than 800,000 total sequences for our 18 soil samples.

We then used the following command for the dada2 denoising:

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux.qza
--p-trim-left-f 0
--p-trim-left-r 6
--p-trunc-len-f 295
--p-trunc-len-r 288
--o-table table2.qza
--o-representative-sequences rep-seqs2.qza
--o-denoising-stats denoising-stats2.qza

The problem started when we looked at our table and denoising-stats. We went down to a total frequency of features that was 5,513. Additionally, while all of our samples had well over 30,000 sequence counts to begin with, after dada2 we had one sample with 0 sequence counts left! Is that reasonable, given how nice our QS plots looked? Should we cut/trim more? Less? We are surprised by these results from soil samples......


1 Like

Thanks @Kara for the extensive documentation!

You can see from the stats file that the vast majority of sequences are being dropped at the filtering/denoising step. This is indeed due to insufficient truncation; too many low-quality bases are being left at the tips of these reads, causing many to be dropped.

The forward reads look great but the reverse reads look less good — I would truncate those to no more than 260 (you could try eking out a bit more if needed for merging, but I’d lean toward truncating more here if this is 16S data)

However you are also losing some reads at the merging stage. So before telling you to truncate more I just want to make sure you have enough read left over for sufficient merging (~20nt min overlap between reads).

  1. what is your target amplicon?
  2. what is the expected amplicon length?

I hope that helps!

@Nicholas_Bokulich, we weren’t exactly sure where the problem began, so we thought we’d include everything but the kitchen sink! Thank you for looking at our data! We did amplicon sequencing of the V1-V3 region of the 16S rRNA gene. That region should be about 490 bp, so with our 2x300 MiSeq reads (600 total), we should have 110 bp overlap (is that correct?). We should be able to truncate the reverse reads to 260, no problem.

Can you please clarify one more thing? It’s a bit counter intuitive that we are losing sequences in filtering/denoising due to insufficient truncation. You would think that if you truncate more, you would lose more sequences in the end. Is it because low-quality bases CREATE more noise, and by truncating more bp from the ends of the read we actually improve the sequences, decrease the noise, and in doing so, keep more sequences in the end?

Many thanks! This stuff makes us feel pretty smart…sometimes… :slight_smile:

2 Likes

Correct :100:

It is a balancing act. Low-quality bases beyond a certain threshold will cause the sequence to be dropped altogether, so we want to truncate off as much of the noisy 3' end as we can. But truncating too much risks causing read merging to fail.

So yes, it is counterintuitive but trimming/truncation (to an extent) actually improves yields by saving sequences from filtering.

Fortunately, since these are 16S there is not too much length heterogeneity. So you want 490 bp + 20 bp minimum overlap = 510 bp minimum combined length. 290 + 260 indeed should merge no problem. You could even attempt to trim more — say, 270 forward, 250 reverse — and see how that compares (you might lose slightly fewer sequences to denoising so it may increase yields)

Good luck!

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.