Looking for advice - sequence loss in DADA2 merge

toobiwankenobi · January 3, 2019, 4:45pm

I'm facing a similar problem than @cintia_martins did. I also end up with a low number of merged reads even though my quality scores look quite good (see screenshots).

I have chosen my trunc lengths based on a look of the quality score distribution for both forward and reverse read. --p-trim-left were used to trim the primers. Here is the command I ran:

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux_paired_end_bacteria.gza.qza
--p-trim-left-f 17
--p-trim-left-r 20
--p-trunc-len-f 251
--p-trunc-len-r 234
--o-table table.qza
--p-n-threads 0
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza

Could you imagine what could be improved to increase the number of merged reads?

Thanks for your help!

ChrisKeefe · January 4, 2019, 5:19am

Welcome aboard, @toobiwankenobi !
Thanks so much for including screencaps and complete commands. This helps a ton. I’ll do my best to help out, but some housekeeping first - because this post was split from another post, I’m not sure whether you generated the title. If so, please be more topic-specific in future. If that’s not your title, sorry! .

Your denoising-stats visualization suggests that sequences are dropping out because they are failing to merge. DADA2 requires ~20 base pairs of overlap between forward and reverse reads in order to join them. If you’re working with a longer amplicon, it’s possible that many of your reads have insufficient overlap to merge. There’s a good discussion of this in these posts.

Once you’ve determined whether this is your situation, there are quite a few posts here that discuss possible next steps. Let us know if you have trouble moving forward. If you think this is not your situation, please share further sequencing details so we have a little more to work with.

Thanks!
Chris

toobiwankenobi · January 6, 2019, 3:35pm

Hi @ChrisKeefe ! Thanks for your reply! It’s not my title, I gave a response to another topic and @thermokarst moved my answer into a new topic.

I don’t think that the length of the amplicon is the problem, it has a length of around 420 bp. So 2x300 should be sufficient, right?

Something to clarify: is the “merged” number the number of merged reads or the number of pairs? If its the number of pairs, my stats are not as bad as initially thought. If “merged” is the number of merged reads, then its still bad…

Do you have a suggestion how I could improve the outcome? I now tried to increase the error rate to “–p-max-ee 5.0”, maybe this helps (it’s currently running). Sorry, I’m really not an expert, I’m more at a “try-and-error” stage of sequencing data analysis

ChrisKeefe · January 7, 2019, 9:32pm

@toobiwankenobi, the table in your original post loosely describes the process to which DADA2 subjects sequences, in chronological order. You appear to be losing most of your sequences during the merge phase, not during filtering.

Allowing more error in your sequences may increase the number of sequences, but that probably won't address the larger merge-related loss shown by your table.

Your sequence runs may be 2x300, but you're only handing DADA2 448 base pairs to work with during the merge. (251-17 + 234-20 = 448) If your amplicon is 420 bp long, and you need 20 bp of overlap to complete the merge, that leaves you only 8 bp of wiggle room.

If there is significant variability in the length of your ASVs, there is a chance that short sequences might be systematically removed in the merge process, decreasing sequence count and introducing bias against short ASVs.

Experimenting with more generous truncation parameters may take care of the issue. If it doesn't (e.g. because more-generous parameters introduce too much error into your sequences), you may have to pursue other denoising strategies, or even analyze your sequences as unmerged single-end reads. There are many posts here on the forum describing these options, should you need them. I'm not recommending this at this time - I just want you to know that other paths exist if you can't make this work by adjusting parameters.

"merged" is the number of merged reads. As discussed here, your table's "input" figure is the number of pairs; any decrease in sequence count is loss:

toobiwankenobi · January 8, 2019, 1:09pm

Hi @ChrisKeefe, thanks again for your help!

I ran several rounds of the dada2 pipeline on just one sample to reduce computational time. I found the following options to be the most successful regarding number of reads after merging.

qiime dada2 denoise-paired
–i-demultiplexed-seqs demux_paired_end_bacteria.gza.qza
–p-trim-left-f 17
–p-trim-left-r 20
–p-trunc-len-f 255
–p-trunc-len-r 235
–p-max-ee 4.0
–o-table table_best.qza
–p-n-threads 0
–o-representative-sequences rep_seqs_bst.qza
–o-denoising-stats denoising_stats_best.qza

With these options, I pass 453 bp to work with for merging and I end up with around 27k of originally 66k reads which is still quite a huge loss but I think it’s acceptable.

Thanks for your help!

system · February 8, 2019, 7:09pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.