Low merged percentage with running DADA2

jazzy1 · September 19, 2023, 2:59pm

Hello,
I am working with sequences generated on an Illumina MiSeq with v3 chemistry (2x300 bp) with primers targeting the V3/V4 region. The primers used were 341F (CCTACGGGNGGCWGCAG) and 805R (GACTACHVGGGTATCTAATCC). But when I use QIIME 2 (version 2022.11.1) and DADA2 for denoising, I'm only getting about 30-40% of my sequences being merged. So, I'm wondering if this would be acceptable? Or are there ways for this to be improved?

As far as I can tell there should be plenty of overlap for merging, as DADA2 should require a minimum of ~12 bp and the amplicon size based on these primers should be 464 bp.

Removing the primer sequences with cutadapt:

qiime cutadapt trim-paired --i-demultiplexed-sequences demux-paired-end-V3V4-2.qza --p-front-f CCTACGGGNGGCWGCAG --p-front-r GACTACHVGGGTATCTAATCC --o-trimmed-sequences trim-demux-paired-end-V3V4-2.qza

Summarizing the information about the sequences after primer removal:

qiime demux summarize --i-data trim-demux-paired-end-V3V4-2.qza --o-visualization post-cutadapt-demux-V3V4-2.qzv

The output from this summary looks like this:

So, I tried some different levels of truncation during DADA2. This was the command used:

qiime dada2 denoise-paired --i-demultiplexed-seqs trim-demux-paired-end-V3V4-2.qza --p-trunc-len-f 273 --p-trunc-len-r 210 --o-representative-sequences rep-seqs-dada2-273-210-v3v4-2.qza --o-table table-dada2-273-210-v3v4-2.qza --o-denoising-stats stats-dada2-273-210-v3v4-2.qza --p-min-fold-parent-over-abundance 2 --verbose --p-n-threads 8

And here is a summary of what that looks like at varying truncation levels:

trunc fwd/rev	% passed filter	% merged	% non-chimeric	median of % non-chimeric
273/210	64.67 - 75.83	34.16 - 46.02	33.10 - 41.83	38.11
273/216	63.47 - 74.74	33.86 - 45.65	32.82 - 41.54	37.59
274/231	58.07 - 71.47	31.93 - 43.61	31.15 - 40.02	35.54
280/200	62.95 - 75.74	33.49 - 44.27	32.54 - 40.97	37.4
280/250	44.51 - 62.87	24.85 - 37.22	24.46 - 35.06	28.88

It just seems that I lose half of the sequences during merging that did pass filter. I believe there is more than enough overlap for merging, so that shouldn't be the issue. And I've removed primers are there are no adapters on my sequences. Any thoughts about what is happening here or how to improve the outcome?

And here are the files I've generated, just in case they are helpful. I've included the output from DADA2 with the trunc values of 273-210:
post-cutadapt-demux-V3V4-2.qzv (322.1 KB)
rep-seqs-dada2-273-210-v3v4-2.qzv (2.0 MB)
stats-dada2-273-210-v3v4-2.qzv (1.2 MB)

colinvwood · September 19, 2023, 4:59pm

Hello @jazzy1,

Your truncation positions might be a bit too deep into the reads. Try to go as far back as overlap will allow--try 260, 220.

jazzy1 · September 19, 2023, 8:16pm

Thanks for the suggestion, @colinvwood!

I'm not sure I understand what you mean as I had tried some less deep truncation positions, such as 280 and 250. Nonetheless, I gave 260 and 220 a shot, but still got much the same outcome with a slight improvement. There were 66.08 to 76.25% of reads passing filter, and then only 35.68 to 47.69% were merged.

colinvwood · September 20, 2023, 12:02am

Hello @jazzy1,

This might be one of those situations where the quality scores are significantly low enough at the ends of the reads, where the overlap has to occur, to not allow merging to occur well. You can usually address this by trimming earlier in the read where quality scores are higher. However, around forward ~265 and reverse ~225, which is already pushing it in terms of having enough overlap because of insert length variation, the quality scores are already fairly low.

You could try lowering the --p-min-overlap parameter. This will raise the number of merged sequences but the drawback is you'll have lower confidence that merged reads actually belong together.

You could also try raising the --p-max-ee-f and --p-max-ee-r parameters, though I don't know off the top of my head if these only come into play with the filtering step or also the merging step. The drawback is that you're letting through more sequences with more possibly erroneous bases.

jazzy1 · September 20, 2023, 2:34pm

Thanks again, @colinvwood!

I had played around with the --p-min-overlap parameter in a different dataset I was working on. I was curious as to whether playing around with that number would lower the confidence in the result, as I assumed it was the default for a reason?

I hadn't tried using the forward reads only because it felt like a shame to throw away half of the data that was generated, but I had been reading about that as a potential solution. So, I gave this a go, and it did improve the results quite a bit. Using only the forward reads and a --p-trunc-len of 260, I had the percentage of input passing filter ranging from 74.05 to 83.24% and the percentage of non-chimeric ranging from 60.51 to 69.08%. I'm thinking I will proceed with these outputs, as the percentage of reads making it through DADA2 is quite a bit higher.

colinvwood · September 20, 2023, 4:54pm

Hello @jazzy1,

If you have the bandwidth you can always do both (analysis with single end and analysis with paired end).

Continuing with just the forward reads might seem like a no brainer because of the higher percentage of reads passed, but each of these reads has less information. Say you have 1000 reads of both forward and reverse, each 250bp. Keeping 65% of the forward reads gives you 650 reads, with (650 * 250) = ~162k bases of information. Keeping 40% of the reads after merging gives you 400 merged reads, with (400 * (250 + 250 - 12)) = ~195k bases of information. So using forward reads will give you more things to compare but less informed comparison, while using merged reads will give you less things to compare but more informed comparison.

Using only the forward reads here is probably pretty comparable to a V3-only analysis because the first read extends from the forward primer 260 bp into (but granted probably a bit beyond) the v3 region.

jazzy1 · September 20, 2023, 5:18pm

Thanks, @colinvwood.

Do you mean doing both single end and paired end analyses and comparing the outputs? Or do you mean completing both and incorporating both together?

Yes, this was exactly some of my thoughts as to why I was hesitant to use only the forward read! You summarized that very eloquently, which is very helpful.

colinvwood · September 20, 2023, 5:50pm

Hello @jazzy1,

Comparing the results, or incorporating the (separate) interpretations together. It won't be possible to combine the single end and paired end sequences together for analysis in qiime.

jazzy1 · September 21, 2023, 11:43am

That's what I thought - that makes sense! Thank you for some helpful input, @colinvwood.

system · October 22, 2023, 5:43pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.