denoising on multiple runs to be combined down stream

Mehrbod_Estaki · April 18, 2020, 11:48pm

Hi @hsapers.
The ultimate goal here is that when you are comparing reads they should be of the exact same region, otherwise even a single nucleotide between them would lead them to be called a different feature.
For example:
In study 1: V4 region, trim length = 0

F: AACCGGTT

In study 2: same V4 region, same read, trim length =1

F: ACCGGTT

So even though these 2 reads are technically identical and should be called the same feature, with a single nucelotide trimming they will be called differently.
The same applies for truncating from the 3' instead of trimming from 5'.

In study 1: V4 region, trunc length = 0

F: AACCGGTT

In study 2: same V4 region, same read, trunc length =1

F: ACCGGT

So this explains the first statement as to why when doing denoise-single trim-left and trunc-len need to be the same.

Now let's consider a paired-end example:
2 different runs, same primers

Run 1: no trimming or truncating

F:      AAACCCGGGTTT
R:            CCCAAAGGGTTT
merged: AAACCCGGGTTCCCAAA

Run 2: trim-left-f=0, trim-left-r=0, trunc-len-f=1, trunc-len-r=1

F:      AAACCCGGGTT
R:             CCAAAGGGTTT
merged: AAACCCGGGTTCCCAAA

You can see that the merged-feature is identical in both cases because the truncating within the overlap region didn't change the length of our read, nor its sequences.
However, let's repeat Run2, now but we'll add trim-left=1

F:      AACCCGGGTT
R:             CCAAAGGGTTT
merged: AACCCGGGTTCCCAAA

Now we see that our merged sequence is actually 1 nt shorter, and thus not comparable to merged sequence from Run 1. So, with paired-end reads, as long as there is sufficient merging happening, the truncating parameters can be different, but NOT the trim-lengths. This is why the tutorial is explicit about saying for 'denoise-single' both parameters need to be the same.

In your case you said you are not using the trim-option because you are using cutadapt, that technically means your trim-length options are indeed the same, zero. So you are meeting that criteria.

As for your approach with trunc-q=18, I would advise against this. With denoising methods, filtering based on these scores is not really necessary anymore which is why the default is set to 2. I would keep this default. Setting it to 18 would indeed cause a lot of length variation, but is unnecessary here.

Hope this helps.