DADA2: loss too many reads after filtering

Hello all!

Setup:

I have 16S V3-V4 amplicon sequences using 341F–805R primers. Samples are extracted from soil. The raw data have barcodes and primers, so I use cutadapt to remove it.

Here is the data after barcodes and primers removal:

trimmed-seqs.qzv (321.6 KB)

Problem:
When I run DADA2, I lose too many reads after filtering and left less than 20% after chimera removal.

Command:

qiime dada2 denoise-paired \
--i-demultiplexed-seqs outputs/trimmed-seqs.qza \
--p-trunc-len-f 244 \
--p-trunc-len-r 244 \
--p-n-threads 0 \
--o-table outputs/table.qza \
--o-representative-sequences outputs/rep-seqs.qza \
--o-denoising-stats outputs/denoising-stats.qza

denoising-stats.qzv (1.2 MB)

I can see I have pretty good read quality both the forward and the reverse, so I didn't trunc too much, also to make sure it can overlap. But the biggest problem is I loss too many reads in the filtering step. I searched some forum posts, but their problem like low reads quality or strict –p-trunc-q value doesn't suit my questions.

I have no idea what causes the problem, any help will be much appreciated!

1 Like

Hello!
Could you try to set those parameters to lower values?
As you can see from seqs stats, 98% of subsampled reads are shorter than 244. All these reads are discarded in your case. Setting it to a lower value will allow the reads to pass the filter after truncation. Or you can provide a 0 to disable truncation.

1 Like

Hi! Thanks for your reply! I disabled truncation, and here is what I got:

Above 80% reads passed filter which was clearly improved compare to previous, but the non-chimeric is lower than 50%, is that normal?

Thanks for your help!

Hi!
I would try to run it again but this time with several sets of lower truncation values to compare results with one you got and decide based on it how to proceed. As you can see, you are also loosing reads on merging step and truncation of bad quality ends of reads may improve it.

1 Like

Hi, timanix!

Thanks for your suggestions! I have two questions now after I read through some other posts.

  1. So based on the below post, it explains --p-trunc-len discard sequence greater than the value it gives, which in my case, it discard sequence longer than 244 instead of discard sequence shorter than 244? If I understand correctly? I got confused by this post and your reply, which is opposite answer to me.

You're exactly correct, good catch! I'll make sure we get that documentation updated.

  1. I actually run it again with lower truncation values, like

--p-trunc-len-f 230
--p-trunc-len-r 240 \

and

--p-trunc-len-f 240
--p-trunc-len-r 240 \

All get worse results, and with the lower value I give to the forward reads, I only got less than 10% reads shown as non-chimeric.

In case you also need to check the stats file, I will upload it here.
denoising-stats_230_240.qzv (1.2 MB) denoising-stats_240_240.qzv (1.2 MB) denoising-stats_no_truncation.qzv (1.2 MB)

What other parameters I can set to get rid of low-quality reads? Also, what causes the worse results after I lower truncation values since it should be better after I truncate the bad quality reads?

Again, really appreciate your help!

Nope, it will discard reads shorter than given value and truncate longer reads to this value

  1. Could you try to run it with 230 for both forward and reverse reads? If you worry about overlap region in the latest version of Qiime2 you can specify it to a lower value than 12.

You lost a lot of reads when you applied 240/240 since a lot of your forward reads are shorter than 240 and where discarded.
230/240 looks similar to the stats with no trimming.
I will suggest to try 230/230 to see if it will improve stats and choose one with best outputs.

Hi timanix,

I checked other forum post, if I run it with 230 for both sides, I think I will have 4bp gap, right?

(forward read) + (reverse read) - (length of amplicon) - = overlap
230+230-464 = -4 bp,
then my reads cannot merge at all because they don't overlap.

Or, should I discard truncation and lower the value of overlap region which may save the reads loosing on merging step?

1 Like

In that case, probably it is an optimal solution! Decreasing minimum overlap with disabled truncation may save some reads :clap:.

Hi! Sorry for bothering you again!

I tried to use the minimum overlap parameter, I set --p-min-overlap 4, actually the result doesn't change too much. It is basically the same as the no truncation result. Seems I can't do anything else to improve my results? Do you have any suggestions? Thanks!

Hi!
I am running out of ideas :thinking:

Based on your results, you are still loosing reads on merging step after filtering. From one side, you have enough reads to proceed, from another - you results may be biased towards bacteria with shorter amplicons.
So option 1 will be just proceed with disabled truncation by Dada2 with a risk of biased data.
Option 2 is to use only forward reads. This way you will keep most of the reads but taxonomy annotations will be performed with shorter reads.
Option 3 is to try to merge your reads with vsearch plugin, where you can not only decrease overlapping region but also allow some mismatches and then denoise by Deblur to see if it will improve the output.

1 Like

Hi @Xinming_Xu!
Got a hint that you also may take a look on Figaro tool (not in Qiime2) that may help determine best truncation parameters.

Hi, really appreciate for all your suggestion! I've seen people using figaro, so first I give a shot to this method.

So, the recommended forward truncation position is 238 and the recommended reverse truncation position is 249.

[
    {
        "trimPosition": [
            238,
            249
        ],
        "maxExpectedError": [
            2,
            3
        ],
        "readRetentionPercent": 81.36,
        "score": 76.36009224060184
    },

Then I run it on dada2, also I indicate maxExpectedError to 2 and 3, here is what I got

No reads passed the filter. trunc_len_f (238) or trunc_len_r (249) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.

Then I realize, maybe it's because I forgot to set the primer length in figaro as I'm using the original data without trimming. But it still recommend me the same truncation position.

Figaro seems like a great tool to know the truncation value, but I got confused about the result I got when running dada2.

Hi @Xinming_Xu
So I will suggest to proceed with option 2 or try option 3 to see if it will improve amount of reads.

2 Likes

Hi, I've had a similar problem with some soil DNA. What I found worked best was to use only FWD reads and adjust the p=min to 4. Then I was able to keep more than 70% of reads after DADA2, whereas before adjusting I was only keeping between 30-40%. Interestingly, this is only happening for certain soil samples, other soil samples I collected in a different location are working much better. Just out of curiosity, what kind of soil/DNA extraction method are you using?

1 Like

Hi,

Thanks for your suggestion, I used V3V4, 341F-804R region, and the forward/reverse reads are around 245bp. So it's because the primer sets using or it's soil DNA caused only FWD reads are the best case you've got? I used powersoil kit (Qiagen) to extract my DNA.

I have used both the PowerSoil and a phenol:chloroform extraction method - the PowerSoil does seem to give me better quality DNA, especially after I add some powdered skim milk to the bead-beating step. I'm sequencing reads on a MiSeq v3 600, also with the 341F-804R primer pairs. The extractions I performed with the PowerSoil kit don't need the p-min drop to 4 to keep over 70% of reads after DADA2. But I do just use the FWD reads, the REV reads I usually get are just not good quality. :confused: