denoising on multiple runs to be combined down stream

I’m following the FMT tutorial to process samples from multiple sequencing runs. I don’t fully understand the statement: “When we run denoise-single , we need to use the same values for --p-trunc-len and --p-trim-left for both runs.” I’m using cutadapt to remove primers prior to denoising, so I’m not using the trim options. For truncating poor quality base calls at the end of reads, I’m not sure I understand why trunc-len has to be the same for each run as long as there is sufficient overlap for joining. (I’m also running trunc-q using a threshold of 18 and comparing the results). Using the threshold cut-off would generate reads with different lengths (just as I assume using different trunc-len values for each run would). Is using the same values for each run specific to single end reads and not applicable to paired end reads using denoise-paired?

Running qiime 2020.2.0 installed using conda on a linux server.

thank you

Hi @hsapers.
The ultimate goal here is that when you are comparing reads they should be of the exact same region, otherwise even a single nucleotide between them would lead them to be called a different feature.
For example:
In study 1: V4 region, trim length = 0


In study 2: same V4 region, same read, trim length =1


So even though these 2 reads are technically identical and should be called the same feature, with a single nucelotide trimming they will be called differently.
The same applies for truncating from the 3’ instead of trimming from 5’.

In study 1: V4 region, trunc length = 0


In study 2: same V4 region, same read, trunc length =1


So this explains the first statement as to why when doing denoise-single trim-left and trunc-len need to be the same.

Now let’s consider a paired-end example:
2 different runs, same primers

Run 1: no trimming or truncating

R:            CCCAAAGGGTTT

Run 2: trim-left-f=0, trim-left-r=0, trunc-len-f=1, trunc-len-r=1

R:             CCAAAGGGTTT

You can see that the merged-feature is identical in both cases because the truncating within the overlap region didn’t change the length of our read, nor its sequences.
However, let’s repeat Run2, now but we’ll add trim-left=1

R:             CCAAAGGGTTT

Now we see that our merged sequence is actually 1 nt shorter, and thus not comparable to merged sequence from Run 1. So, with paired-end reads, as long as there is sufficient merging happening, the truncating parameters can be different, but NOT the trim-lengths. This is why the tutorial is explicit about saying for ‘denoise-single’ both parameters need to be the same.

In your case you said you are not using the trim-option because you are using cutadapt, that technically means your trim-length options are indeed the same, zero. So you are meeting that criteria.

As for your approach with trunc-q=18, I would advise against this. With denoising methods, filtering based on these scores is not really necessary anymore which is why the default is set to 2. I would keep this default. Setting it to 18 would indeed cause a lot of length variation, but is unnecessary here.

Hope this helps.


Thank you very much for the detailed explanations @Mehrbod_Estaki

I think that I’m not understanding exactly what is happening at the input for denoising. I thought that using cutadapt was a better method than using the trim function to remove primers because the latter assumes that the primers always start with the first base - and allowing for some error, this may not always be the case and cutadapt searches for and removes the primer by sequence. But if the reads have to be of the exact same region without even a single nucleotide difference, than my reasoning for using cutadapt is false and could create spurious features anyway.

In the case of paired ends - using cutadapt to remove primers, will, if the primers are not all starting at the first base, create some length discrepancy at the 5’ end. During denoising, since each read would start at its new first base (regardless of what was trimmed using cutadapt) this length different would end up at the 3’ end and as long as there was enough overlap after truncating low quality scores, merging would proceed. However, differences induced by cutadapt could still lead to them being called different features?

In the case if single reads, using cut adapt to remove primers, if not all starting at the first base, will lead to reads of different lengths - similar to the scenario of using different trim values for different runs and result in what should be the same features being called as separate features.

unless of course I specify in cutadapt to throw out all reads that are ± the expected length (251).

Should I ignore the few reads that may have erroneous bases before the start of the primer - since these would end up as spurious features either way - either having an extra base(s) before the start of the targeted amplicon region if cutadapt ignores an erroneous first base and then removes the primer, or having an extra base(s) left from the primer if using trim removes the first k bases if the primer is k bases long but there’s an erroneous first base before the start of the primer. I’m assuming these would be extremely rare features and would likely be removed as singletons regardless?

I was also wondering if you could elaborate why filtering based on q-score isn’t required anymore with denoising methods. Isn’t truncating reads based on where the confidence of the base calls drops just a blunt end way of quality filtering?

thank you

Hi @hsapers,
I think one important to note is that by default cutadapt will not only remove the primers but also everything preceding it. That means even if for some odd reason you have weird non-primer nt before your primers then they will be tossed. I like to always include the --p-discard-untrimmed parameter also when I run cutadapt so if there’s any reads without the primers, they are also discarded. So ultimately you flank a sequence with your primers and the only thing you want to keep is the sequences within that. In paired-end cases, even if you have odd cases where your primers do not all start at the same position on the 5’ (and as you said leading to length discrepancy on 3’), it doesn’t matter because a) we toss the preceeding sequences, and b) the merging will correct for those length differences meaning you won’t have the same target being called multiple features because of their length discrepancy.
ex: Sequencce 1 and 2 are biologically the same, however for some reason Sequence 1 has an extra nt at the beginning (X):

Sequence 1: (overlap region= CCGG, X=random, non-biological sequences somehow preceding our primer)
F: X-[primer]-TTTTCCGG 3'
R: X-[primer]-AAAACCGG 3'
cutadapt to remove [primer]. 

Sequence 2: same as Sequence 1, without X, small letter t/a=the extra sequences as a result of no X

F: [primer]-TTTTCCGGt
R: [primer]-AAAACCGGa
cutadapt to remove [primer]
Result:    TTTtCCGGaAAAA

So, Sequence 1 (TTTTCCGGAAAA) == Sequences 2 (TTTtCCGGaAAAA) even though the primers started at different positions. It’s because the extra sequence falls in the overlap region which is aligned based on sequences.
If you were to however just use the trim function instead of sequence based primer removing with cutadapt you would see:

Sequence 1:  using trim instead of cutadapt.
F: X-[primer]-TTTTCCGG 3'
R: X-[primer]-AAAAGGCC 3'
Using trim 6 (because we THINK the word primer is 6 bases long) instead of cutadapt to remove [primer].
Result:    [r]TTTTCCGGAAAA[r] 
r= the remaining base in [primer] because we cut 6 and didn't account for the extra X.

In this scenario our merged sequence would be called differently than with cutadapt.
So, based on this scenario your statement

“I thought that using cutadapt was a better method than using the trim function to remove primers”

would be true. However, I should note that this scenario should be easy to avoid if you have processed your reads carefully up to this step. The dada2 trim function is convenient when you know your sequences all start in the same position. If you’re unsure, cutadapt would certainly be more reassuring.

In single-end cases, the same logic applies, however you will need to make sure you truncate some sequences from the 3’ (as you should be doing anyways when removing poor quality tails). If you don’t truncate anything and you do have your primers starting at different positions on the 5’ then yes you can have multiple calls on the same biological feature.

Sequence 1 and Sequence 2 are biologically the same, however for some reason Sequence 1 has an extra nt at the beginning (X):
Sequence 1:
X-[primer]-TTTTCCGGAAA 3'
Sequence 2:
[primer]-TTTTCCGGAAAAt 3'      # 1 extra t here because we didn't have X

cutadapt to remove [primer] :

Sequence 1:
TTTTCCGGAAAA 3'     length=12
Sequence 2:
TTTTCCGGAAAAt 3'    length=13

Therefore 2 distinct features will be called even though biologically these should be the same sequence. Sequence 2 just happens to have 1 extra nt.

Now, we take these sequences to DADA2, and you use the trunc function to cut from the 3’:

dada2, --p-trunc-len 10 ; which is to say trucate everything at position 10

Sequence 1:
Sequence 2: 

Therefore, Sequence 1 == Sequence 2. yay!
So, the truncating was important here!

The newer methods rely on max-ee or maximum expected error allowed, instead of quality scores to filter poor quality reads. This has been shown to be more effective than traditional quality score averaging. See Robert Edgar’s original paper for more details on this. This is why default --p-trunc-q in dada2 is 2, and I would recommend leaving it at that.

Hope this helps!

1 Like

Thank you @Mehrbod_Estaki - excellent and detailed explanations - I think I’m starting to understand this better.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.