Losing reads during merging but I have enough length for overlap

johnbiggs · June 2, 2023, 3:27pm

Hey everyone,

I'm trying to teach myself how to process 16S microbiome data and I'm currently processing 13 samples of paired-end 16S 2x250 data and I'm losing ~40% of my reads in the merge step of dada2 even though the truncation lengths I set for my FWD and REV reads give enough overlap.

To start I trimmed my sequences with cutadapt to remove the primers that were in my reads (515F-907R, V4-V5). Here is the visualization of my trimmed reads.
trimmed.qzv (317.4 KB)

From here I did a lot of parameter testing to see what parameters gave me the best read count after using qiime dada2 denoise-paired. From my testing the step that makes me lose the largest amount of sequences is in the merge step.

219-186.qzv (1.2 MB)
219-186-5.qzv (1.2 MB)
225-186.qzv (1.2 MB)
219-210.qzv (1.2 MB)
219-210-4-4-4.qzv (1.2 MB)
219-210-4-6-4.qzv (1.2 MB)

Above are the results of all the tests I ran and I think for the truncation lengths 219F + 210R give me the best results. The other parameters that seem to help are trim_left_f = 5, max_ee_f = 4, max_ee_r = 6, min_overlap = 4. But with having way over 12 bp of overlap with my choice of truncation lengths I'm still losing ~40% of my read count from the merging step and I can't figure out why. I can't figure out what other parameters I need to edit to help with my read count, if I just need to change my truncation lengths, or if having roughly around 30% as a final read count is acceptable and I should move on with my analysis.

The data I'm analyzing is from a research paper I found online to help me practice with 16S analysis.

colinvwood · June 2, 2023, 7:04pm

Hello @johnbiggs,

Thank you for the well-written post and labeled visualizations!

A couple of observations re truncation lengths and parameters:

Your amplicon size is 907 - 515 = 392 nucleotides (with variation).
With truncated read lengths of 219 and 186, and a minimum overlap of 12 bp: 219 + 186 - 12 = 393 nucleotides, only 1 nt above your average amplicon length. It's likely that you're losing amplicons that are longer than 392 nt due to biological variation.
With truncated read lengths of 219 and 210, and a minimum overlap of 12 bp: 219 + 210 - 12 = 417 nucleotides, 25 nt above your "average" amplicon length. This is much safer but reaching deep into the low quality range of the reverse reads. And might still exclude the longest v4/v5 region having species, would have to do some research on length variation.
I would refrain from altering the max_ee_f, max_ee_r, and min_overlap parameters. Setting min_overlap lower especially. For example, setting min_overlap to 4 as you have means that any pair of reads have a 1 in 4^4 = 1 in 256 chance (higher once you add the chances of longer overlaps) of merging just by random chance. Even ignoring homology you can probably agree that this is too high.
Looking closely at your quality graphs, I don't think that a forward truncation length of 219 is ideal. If you zoom in on the region around the first drop off in quality around positions 220 - 230, you can see that position 222 is the last position before drop off. You could safely change 219 to 222 here. One could even make an argument for ignoring this drop off since the quality recovers.
You seem to be more tolerant of lower quality in the reverse reads than in the forward reads--the first trimmed position in the forward reads has a first quartile score of 23, while the reverse has a score of 19. Whatever cutoff you deem acceptable, I would try to apply it consistently to both read directions.

All this being said, I agree that it's a little strange that the merging percentage isn't higher. I think the combination of library prep decisions and the quality profile of the reads is falling a bit short of being able to merge everything.

One weird thing going on is with your forward read length distribution. You can see that the median length is 225 (indicating primer trimming), but the 75th percentile is 251. It seems like anywhere between a quarter and half of your forward reads did not get their primer trimmed. I would investigate this further, perhaps you didn't properly include the degenerate bases for one of the primers? This could well be why you see so much loss to the chimeric filter. I wouldn't expect this to affect merging, but who knows.

johnbiggs · June 5, 2023, 12:57pm

Thank you for your reply @colinvwood,

Taking in the advice you gave me I made some changes to try to get better read count carry over.

For the FWD read length distribution, I don't know why there is a discrepancy between the median length and the 75th percentile. I tried running my cutadapt command again and making sure I placed the correct nucleotides in the right parameters and it didn't change the outcome of the median and 75th percentile discrepancy. I don't think I using the --p-anywhere commands would change this outcome since the paper doesn't mention they used spacers at all for their PCR amplification (I still don't know when you would use --p-anywhere rather than --p-adapter and --p-front).
I really don't know why there is this discrepancy in the first place but I think I'll chalk it up to working with data and it just being something I should include in my analysis of this data.
I decided to have a quality cut off score of 20 for my reads so I can have some consistency in the first quartile scores for my truncation lengths.
With the cut off score of 20, I don't think I can really go past 186 for the REV reads, since after that point it includes scores bellow 20 for the first quartile scores. So I kept the REV truncation length at 186 and for the FWD reads I chose to select 222 for the truncation length. I know that 222 + 186 - 12 = 396, which is only 4 nt above the average amplicon length but I hope that it will increase the amount of reads kept during the merging step.
I also plan on testing higher truncation lengths for the FWD reads since the dip in quality scores for the FWD reads is smaller and I think the minor dip bellow a Q score of 20 would be fine compared to the larger dip bellow the Q score of 20 for the REV reads.

When these tests finish I'll update this post with the results I got from my testing. I hope that these new changes will help carry over more reads in the merging step.

johnbiggs · June 5, 2023, 3:35pm

After the testing I did, using the 222 FWD read location only slightly increased the read count from the 219-186 test I did originally. I really don't know what else might be causing the issue of reads not fully merging and if I increase the truncation length above 224-225 for the FWD reads I end up losing almost all of the reads after dada 2 is done.

235-186.qzv (1.2 MB)
222-186.qzv (1.2 MB) (This one is 229-186)
222-186.qzv (1.2 MB)

colinvwood · June 5, 2023, 4:57pm

Hello @johnbiggs,

As you noted, allowing only reads from amplicons that are 4 nt bigger than your average amplicon length is probably not a good idea. There is certainly enough biological variation to cause a significant amount of amplicons to be lost.

One peculiarity of how dada2 works is that all reads that are shorter than the given truncation length are discarded. This is why you see such bad results for the forward truncation length of 235. From looking at the read length distribution, you don't want to go above 224 for the forward truncation, and not above 230 for the reverse truncation.

From the experimental runs you linked in the first post, you were getting best results around 219F-210R. I would zoom in on this region to find the optimal parameters. Of course, now you can raise 219 to 222.

Other than the chimera issue, I think you've done everything you can do. As mentioned, this is mostly a failure of planning combined with a little bit of bad luck with read quality.

Let me know if you have any other questions!

johnbiggs · June 5, 2023, 5:16pm

@colinvwood

Thank you for all the help you gave me! I'll look into the 219F-210R region with changing the FWD truncation length to 222 and 224.

I just have one more question. As for the other parameters would you recommend not changing them (the max_ee and min_overlap)? I saw in your original post that you recommend to not mess with them, is it just they add too much error into downstream analysis?

colinvwood · June 5, 2023, 9:38pm

Hello @johnbiggs,

I recommended against changing those parameters because they have defaults that have been tested and benchmarked. Raising max_ee_f and max_ee_r or lowering min_overlap are going to give you more junk/bad data. At they same time, they are parameters for a reason, so if you have a good justification for altering them and understanding of what they do, then you can.

system · July 7, 2023, 3:38am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.