Deciding where to trim in Dada2 on paired end reads?

Kasia_S · March 5, 2021, 6:14pm

Hello!
I'm completely new to the world of QIIME2 (and python in general) so my apologies for this very naive series of questions.

I'm running qiime2 2020.11 in conda and processing some mouse microbiota samples. I've made it from importing my pair ends, processing them in dada2, assigning taxonomy and even producing bar plots (yay!). However, I've struggled identifying the correct trim/trunc combination to use for these data.

I've tried tried a few different combinations based on what I've read in forums here and consulting with other scientists. All of these have given me plausible but different community compositions.

Here's how my demux file looks:

Based on this, which would be "better":

--i-demultiplexed-seqs demux-paired-end.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 270
--p-trunc-len-r 230
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza
--verbose

(results in approx 12,000 features)

or

--i-demultiplexed-seqs demux-paired-end.qza
--p-trim-left-f 13
--p-trim-left-r 13
--p-trunc-len-f 270
--p-trunc-len-r 230
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza
--verbose

(results in approx. 6,000 features)

If I understand correctly, this huge difference in output is due to the fact that one run trims the first bps i.e. the primers. What are my best strategies to figure out whether to trim or not/which suggestion is best?

Also, I've read that I should check to make sure I have a good overlap between my paired-end reads. Is this available in one of the output files dada2 generates or do I need to run a separate command to check this?

Thank you in advance!

timanix · March 5, 2021, 7:41pm

Hi!
It is difficult to answer without knowing, what is a history of your data. For example, based on the amplified region you may learn expected length of joined reads, and calculate the length of forward and reverse reads that will be a compromise between keeping a good overlap and trimming as much of bad quality bases at the ends as possible.
Also, it is important to know if you already removed primers from reads with cutadapapt, or they are still in the reads. If they are, you may want to trim them with
–p-trim-left-f
option. In this case, you need additionaly adjust the length of –p-trunc-len-f and –p-trunc-len-r, so you still have at least 20 nt of overlap between reads.

For example, your expected amplicon size is 450 nt. So, you may try something like 270 and 210. If your primers are still in the forward reads, and you want to cut them from the reads by –p-trim-left-f, you need to compensate it by increasing –p-trunc-len-f and –p-trunc-len-r options to keep enough of overlapping nt.

Of course it is better to play with this numbers and choose the ones with best amount of output reads.

Kasia_S · March 5, 2021, 9:03pm

I'm not sure how much this information helps but:

this is V3 and V4 amplification on an IlluminaMiSeq
I downloaded the pair end reads (two reads per sample) from basespace. I assume this doesn't trim the primers?

Here is some additional information I got from the collaborator who suggested the 0-270 and 0-240 truncation:

So the primers which give an estimated ~464 bp amplicons size which on a 2x300 cycle run should have just about ~140 bp overlap.
Basically you want to remove as much of the poor quality reads as you can respecting that minimum of 20 bp overlap is required for proper merging of paired reads following denoising.
Based your quality plots you may truncate your Forward reads at the ~270 position and your Reverse reads at the ~230 position. Also, put in your mind that sometimes you should some sequences from the 3 5’ of the Forward reads to get rid of that initial dip that you can see in the forward reads, however, this is not required in this case.

I'm not sure if this is at all helpful but please let me know if you have any additional insight.

timanix · March 5, 2021, 9:39pm

This looks reasonable and good to me, and if you received 12000 reads per sample as an output, that's good enough to proceed. I would double check if primers are still in the sequence. Usually when working with MiSeq, you are receiving already demultiplexed reads without barcodes, but with adapters still remained. So I would remove first 15/20 nt in forward reads (if primers/adapters in the reads) and compensate it by increasing "trunc" length for forward and/or reverse reads to keep overlapping region.
You can search for primers in the reads to check if they are still in the reads.

Kasia_S · March 5, 2021, 10:37pm

This makes a lot of sense since my MiSeq data was indeed demultiplexed. Are the adapters only in the forward reads or should I remove the first ~15-20 from both forward and reverse?
Thanks so so much for your help!

timanix · March 5, 2021, 10:46pm

Reverse primers will be trimmed anyway by trunc-r parameter, so no need to worry about them on this step

system · April 6, 2021, 4:46am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.