“Atacama soil microbiome” tutorial DADA2 - trimming and truncate location

Hello,

I'm running “Atacama soil microbiome” tutorial and have a question regarding where to trim and truncate using DADA2.

I was trying to figure out why you chose 13 and 150 sequence base here:

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demux.qza \
  --p-trim-left-f 13 \
  --p-trim-left-r 13 \
  --p-trunc-len-f 150 \
  --p-trunc-len-r 150 \
  --o-table table.qza \
  --o-representative-sequences rep-seqs.qza \
  --o-denoising-stats denoising-stats.qza

I used this command to get a demux.qzv as I want to see the quality plots. I got the demux.qza from this link. demux.qza
This file is provided right before the above commands I found on the tutorial page.

qiime demux summarize --i-data demux.qza --o-visualization demux.qzv

Next, I looked at the quality plots from demux.qzv file. Here is a screenshot.


If I choose the first sequence base that has a quality score lower than 30, then it would be 142 for forward reads, and 4 for reverse reads.

How would you determine that 150 and 13 is the appropriate sequence base to trim and truncate?

Thanks!

Hello @Sihan_Bu,

The tutorial states:

"[...] but no trimming is being applied to the ends of the sequences to avoid reducing the read length by too much."

Truncating the reverse reads at position 4 wouldn't make any sense anyway because they are already trimmed to position 13.

As far as why they choose to trim to position 13, trimming a small number of bases from the beginning of reads is common with illumina sequences because they tend to exhibit a dip in quality as seen in these figures.

Hi,

Thank you for your reply.

  1. I'm new to QIIME2. Could you elaborate on this statement "the first thirteen bases of the forward and reverse reads are being trimmed, but no trimming is being applied to the ends of the sequences to avoid reducing the read length by too much"?
  • Neither --p-trim nor --p-trunc- command is doing something at the end of the sequences.
  1. For trimming, is it a good choice to always choose position 13 for both trim_left_r and trim_left_f?
  • I saw this video Denoising sequence data with DADA2 from QIIME2 using position 1 for trim_left_r and position 0 for trim_left_f. I have no idea how these positions were chosen. It's also confusing to me how to choose the position for trimming.
  1. For truncating, in this screenshot, forward reads plot shows a position 142 that has a quality score<30. However, reserve reads plot shows position 4 which has a quality score<30.
  • Your statement makes sense that they are already trimmed to position 13 so position 4 is meaningless to truncate. So, if the truncated positions are very different like in this case based on the quality plots, should I just choose the larger one? In this case, I will use position 150 not 4.

  • Another question is why they chose position 150 not 142?

Thank you for your help!

Hello @Sihan_Bu,

Could you elaborate on this statement "the first thirteen bases of the forward and reverse reads are being trimmed, but no trimming is being applied to the ends of the sequences to avoid reducing the read length by too much "?

In this scenario we want the paired end reads to merge otherwise dada2 discards them. In order for them to merge there needs to be a certain amount of overlap. Trimming more bases from the reads makes them less likely to overlap.

Neither --p-trim nor --p-trunc- command is doing something at the end of the sequences.

Yes that's correct in this situation.

For trimming, is it a good choice to always choose position 13 for both trim_left_r and trim_left_f?

No, this value is always chosen according to the sequences at hand.

I have no idea how these positions were chosen. It's also confusing to me how to choose the position for trimming.

There is no set-in-stone rule. Many factors come into play, such as the tolerance for errors, the amount of available overlap, etc.

For truncating, in this screenshot, forward reads plot shows a position 142 that has a quality score<30. However, reserve reads plot shows position 4 which has a quality score<30.

Again the idea that you have to trim or truncate anywhere that the first quartile (I'm assuming this is what you're referring to) goes below 30 is just one approach. I've seen 20 used as this cutoff more commonly.

Your statement makes sense that they are already trimmed to position 13 so position 4 is meaningless to truncate. So, if the truncated positions are very different like in this case based on the quality plots, should I just choose the larger one? In this case, I will use position 150 not 4.

The truncation positions between read directions are not very different--they're both 150. Truncating reads to position 4 would mean you're discarding essentially the entirety of that read. This will never be desirable.

Another question is why they chose position 150 not 142?

See my previous post.

Thank you, Colin!! This is really helpful.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.