Using DADA2, losing 50% of reads after filtering and 80% of reads by the end

My setup:
I have 16S V4 amplicon sequences using 515F–806R primers. Samples are extracted from soil. There are about 6 million reads across 74 samples. Reads are paired end 2x300bp. Running QIIME2 2019.10 in a conda environment.

Problem:
When I run DADA2 I’m finding I lose a lot of reads - around 80% by the end of the chimera removal step.
The biggest drop is during filtering when I lose 50%.

What I’ve done already:
I’ve read through the DADA2 tutorial and three QIIME2 tutorials (Atacama, Moving Pictures, and FMT) and several forum posts.

  1. trimmed reads need to overlap enough to merge. As I understand it, the V4 regions is generally less than 400 bp so my trimmed lengths below are pushing the limits and I probably shouldn’t shorten them anymore. I’ve tried not trimming at all, but I get fewer reads in the end (which makes sense because more errors).
  2. Shortening the reads should remove more errors and increase the number of reads. This seems true - when I shortened the reverse reads from 160 to 140 I squeezed in a couple more percentage points of reads.
  3. Increasing the --p-max-ee should increase the number of reads. I’m hesitant to relax this parameter, but I could still try that.
  4. In some circumstances, tossing the reverse reads (if they’re low quality) and just keeping the forward reads can increase coverage because you lose fewer from merging. Haven’t tried this yet.

Command:

qiime dada2 denoise-paired 
 --i-demultiplexed-seqs demux.qza 
 --p-trim-left-f 0 
 --p-trim-left-r 6 
 --p-trunc-len-f 290 
 --p-trunc-len-r 140 
 --o-table table.qza 
 --o-representative-sequences rep-seqs.qza 
 --o-denoising-stats denoising-stats.qza 
 --p-n-threads 6 
 --p-max-ee-f 2 
 --p-max-ee-r 2

I have already read through some forum posts:

With this one, I’m using the default –p-trunc-q value so I don’t think it applies. Although, they do suggest just using the forward reads.

This one says “9000 sequences is plenty” so maybe my 10,000-20,000 reads per sample is “fine.” However, I don’t feel good about that metric. In these highly diverse communities I would like to accurately represent the diversity without wantonly tossing reads.

Questions:

  1. Is losing 80% of reads high or is that typical? What about 50% during filtering?
  2. Are my trimming parameters leaving enough overlap and could I trim them more to reduce errors?
  3. Should I try adjusting max-ee or is that frowned upon? Will that hurt my merge step?
  4. Would my data be worth running just the forward reads and not trying to merge forward and reverse?
  5. Anything else I haven’t considered?

My data:
demux.qzv (297.3 KB)
denoising-stats.qzv (1.2 MB)

Thank you!

Hi Andrew,
I would love to know the answer for that too - we are having similar issues with faecal/caecal samples using the EMP protocol to the detail with V4-V5 primers (example below). Which protocol are you using for your library prep?

image

Sorry I don’t have an answer (yet), we are currently running some tests.

Best wishes,

Francine

@amorris28 and @FrancineMarques,
Thanks for the detailed post and providing the demux summary QZV — just what I needed to diagnose.

The issue does appear to be your trimming parameters. The forward reads are pretty good quality but 290 nt is a but too long to truncate, given the quality dropoff around base 280-290.

The reverse reads are pretty much rubbish. You could try trimming a lot and pray it joins, but given the length of the V4 I wonder if it may be better to just use the forward reads (since that covers most of the V4 and the most informative part of the V4 anyway, presuming the reads are in the correct orientation). But I recommend trying it both ways: trim the forward a little bit more and try both single-end and paired-end and look at the yields.

That should improve read yield at the filtering stage, since you will save more forward reads that have some low-quality bases between base 280-290.

Don’t do that! That will lead to more filtering. You are losing most of your reads at the filtering stage, not the overlap stage.

Don’t! It can help but I think what I have recommended above is a better way.

Try it! Your reverse reads do not look good, but the forward looks great. Don’t sully the forward by pairing it with grubby reverse reads.

This is more than usual

YES!

Now to you @FrancineMarques
(Thanks by the way for posting on this topic, @FrancineMarques, instead of opening a new one, since you both have what sounds like quite similar problems)

@FrancineMarques to diagnose your issue I would really need to see the demux summary QZV to view the quality profiles. But your problem is pretty similar to @amorris28’s — you are probably not trimming/truncating adequately and low-quality bases are being left in the reads, leading to too much filtering at the pre-filtering step. Try trimming and truncating more to get out more reads.

HOWEVER, your issue is slightly different because you have a longer amplicon (V4-V5) and so need to leave enough read length to permit overlap. So try truncating more and pay very close attention to the read yield at the “merged” step… you want to adjust the parameters so that “filtered” yield increases but the % merged does not go down.

Good luck both.

Thanks for the quick response @Nicholas_Bokulich!

Truncating the forward reads at 280 and tossing the reverse reads worked the best. With just the forward reads I only lost 20% on filtering and about 40% overall. Here’s the code I ended up using:

qiime dada2 denoise-single 
 --i-demultiplexed-seqs demux.qza 
 --p-trunc-len 280 
 --o-table single-table.qza 
 --o-representative-sequences single-rep-seqs.qza 
 --o-denoising-stats single-denoising-stats.qza 
 --p-n-threads 4

And here’s the resulting denoising stats:
single-denoising-stats.qzv (1.2 MB)

I also tried keeping both forward and reverse reads, but shortening the forward reads to 280, but it didn’t change much. Still lost 70-80% of the reads.

Finally, I tried relaxing the max EE on the reverse reads running it with --p-max-ee-r 5, which got the filtering up to 80%, but then lost a lot more on the marge step and so ended up with only 25-30% by the end.

Thanks for the help! I’ll continue with just the forward reads.

1 Like