DADA2 vs Cutadapt

Hello everyone!

I am a beginner in Qiime2 and currently doing some exploratory work using Qiime2 on my soil 16S dataset. I had a hard making a decision of whether to trim off the primers using Cutadapt before DADA2 or using the trim function in DADA2 straight. I am aware that this issue was brought up several times before, and the general consensus seems to favor the latter option of only using DADA2. However, my supervisor prefers trimming the primers with Cutadapt before preceding to denoising. Therefore, I ran a few trials to test if the results differ a lot using various approaches. And it seems they do vary quite a lot. I’d like to ask the experts whether the results make sense and which option should I stick to.

Some background:

  • Sample type: soil
  • Data type: Illumina MiSeq PE250x2
  • Sample size: 147 (including 2 kit controls)
  • Primer sets: (515F-806R)
    • Forward: GTGCCAGCMGCCGCGGTAA
    • Reverse: GGACTACHVGGGTWTCTAAT

I first examined the interactive plot of the demultiplexed data.

image
Figure1. Interactive plot (paired-end-demux.qzv)

Then I ran the following code to trim off the primers with Cutadapat.

qiime cutadapt trim-paired
–i-demultiplexed-sequences paired-end-demux.qza
–p-cores 4
–p-front-f GTGCCAGCMGCCGCGGTAA
–p-front-r GGACTACHVGGGTWTCTAAT
–p-discard-untrimmed
–p-no-indels
–o-trimmed-sequences reads_trimmed.qza

Here are the plots obtained after trimming.

image
Figure2. Interactive plot (reads_trimmed.qzv)

According to the plots above, I ran 4 trials using the listed parameters below.

Table1. Comparison of trim and truncation parameters used in four approaches

The following are the results of the “table summary” section from table.qzv files produced by each approach.

Table2. Comparison of the “table summary” section.

Moving forward, I examined the denoising-stats.qzv files and calculated the averages and SD of the output parameters (n=147).

Table3. denoising-stats.qzv file summary
image

My question is why would these methods yield results with such huge differences, which parameters should I value the most to determine the optimal method for my dataset? Sorry for the lengthy question…

Thank you!

1 Like

Hi @Rui,
Welcome to the forum! And thanks for providing such details of your issue.

I’m not remembering any discussions that favored one method over the other. Can you link to those threads so I can have a read, mainly because as far as I’m concerned there really isn’t a preference over one or the other. The DADA2 trimming function is a convenience thing but is limited in functionality, so not all datasets can be handled with it. On the other hand cutadapt provides quite a bit of flexibility and additional features of removing primers/adaptors so let’s see what works best for you. In general though, they should result in the same output as long as reads actually start with the primers and not something else like spacers or stacked primers etc.

First, when you include the --p-discard-untrimmed parameter you are discarding reads that don’t have the primer in them. When you truncate with DADA2, you retain everything regardless of what your reads start with. I prefer discarding the reads that don’t have the primers in them, because to me those are more likely to be junk reads anyways. But that is one difference between the two approaches you’ve done.

Trial 1 and Trial 4: The only difference here is that 1 nt you remove from the 5 in Trial 1. I’m not sure what the purpose here was but those 2 runs look very similar to me and the difference is negligible. I’d personally not trim that 1 nt.

Trial 1 and 3. The reason why you are seeing more reads from run 1 than 3 is because you are truncating more from the 3’ of your reverse reads, which are generally poor in quality. By getting rid of those poor quality tails, you retain more during the filtering step (Table 3).

Trial 2 and 4: There’s a couple of differences here. Your starting input is somewhat higher when you don’t use cutadapt (Trial 2) meaning that there were a bunch of reads that didn’t have primers in them but are going through DADA2. As I mentioned, these can be bad reads, or some sort of contaminants so may be artificially inflating your # of unique features. Another possibility is that there may be something else before your primers that is causing truncating by a fixed number (DADA2) to produce extra unique features, this isn’t an issue when you trim with cutadapt, because this gets rids of everything before the primers along with it.

As you can see you aren’t really comparing apples to apples here so its hard to say what is the exact source of differences, but my personal recommendations is to go with Trial 4. You could probably even increase your yield a bit more by truncating a bit more from your 3’ in that run.

Here’s a quick calculation:
806-515= 291 amplicon size
2x250 run - 291 amplicon = 209 overlap region
209-12 nt (min overlap for DADA2) + 20 extra nt (natural variation to be safe) = 177 max truncating

Based on this you can technically truncate up to 177 nts combined from your forward and reverse reads and still have have enough to merge your reads. Since these would be getting rid of poor quality tails, this should (from experience) lead to more reads being retained during the filtering step and ultimately more reads at the end.

Hope this helps.

3 Likes

Thank you @Mehrbod_Estaki for the thorough and clear explanation, I appreciate it!

I just need a few clarifications, so If I understood correctly,

    1. When I examine the interactive plot, both forward and reverse plots have the 5’ ends on the left, which means that the primers are all located on the left side right? And if I were to align them, I’d have to flip one of the plots.
    1. The maximum truncating length (177 nts) is calculated assuming the primers have not been trimmed by cutadapt, because 19nts and 20nts will be trimmed off from the forward and reverse reads respectively during Cutadapt. So in theory, the maximum truncating length after Cutadapt should be calculated this way:
    • Forward reads after trimming: 250-19 = 231 nts
    • Reverse reads after trimming: 250-20 = 230 nts
    • 231+230 - 291(amplicon size) = 170 overlap
    • 170 - ( 12 nt + 20 nts ) = 138 nts
    • Is this correct ?
    1. If I were to calculate how many nts have been truncated out of the sequence in my trial4, I can do
    • 231+230 = 461 nts (total length of the forward and reverse reads)
    • 228 + 195 423 nts (forward and reverse reads I retained )
    • 461 - 423 = 38 nts (reads truncated)
    • And in theory, there is still room for 138-38 = 100 nts for truncation.

Again, thank you for your time and patience.

Hi @Rui,

  1. Correct. The overlap occurs on the 3’.
  2. Not exactly, the calculation is based on the overlap region so it doesn’t matter if the primers are trimmed or not.
  3. As above, your max truncating calculation is based on the overlap region, which in your case we know is ~209.
1 Like