DADA2 trimming values for pair-end sequences

Hello again,

Before getting to the merging step I asked about earlier today, it looks like I have got into troubles with dada2... Out of the 4 different runs of paired-end sequences I have, the denoising step for one of them finished (after has been running since Tuesday), but it turns out something went wrong because I am left with 1400 sequences out of ~13 million :disappointed_relieved:. I believe the issue is to do with setting the trimming values and the fact that the sequences don't overlap, so again, I just wanted to double check before I give another command and wait for 4-5 days. I want to be on the safe side as my project is due quite soon and don't have much time to spare. :confounded:

These are the quality plots:

This is the command I gave:

qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-plate2-run2.qza
--p-trim-left-f 7
--p-trunc-len-f 221
--p-trim-left-r 7
--p-trunc-len-r 221
--o-representative-sequences rep-seqs-plate2-run2.qza
--o-table table-plate2-run2.qza

And these are the results:

I realized that because I was so stressed not to trim too much of the sequences so that they can overlap, I did not trim the primers and adapters as it was mentioned in a different post. I know the primer is 17 bases long, but I am not sure about the adapter. The first 50 bases or so seem quite similar to me between sequences, so was thinking of increasing the --p-trim-left value to 50. Does that sound as a reasonable value? Would it also be worth increasing --p-trunc-len? Based on the quality plot I thought it would be a suitable, but now I am thinking it is too low.

Could there be other reasons why the denoising went wrong that I am overlooking?

Thanks so much for the help and support!! :star_struck:

Hi @Alex_14262,

Sorry for the delayed response!

Another option would be to use q2-cutadapt to identify where your primers end (catching the adapter as a side-effect) to maximize the amount of bases you keep. Check out the --help text for trim-paired for more info.

Your quality plots are honestly superb. I wish everyone had data that looked that clean. You should be fine pushing the trunc-len a little bit more on your forward reads maybe to around 240? I think your reverse reads are fine around 220, going further might be more trouble than it's worth.

There's a couple of places this can happen, although usually we see that kind of loss because of a failed merge, which suggests you may just not have the overlap even in ideal circumstances. How long is your target amplicon?

Other places to look are the primer which causes things to be misidentified as chimeric (which you're already working on correcting).

Assuming you are running QIIME 2 2017.12 (or 2018.2) you should be able to add --verbose to the command and it will print a table at the end indicating how many reads are making it between steps. (We plan on turning that into a .qzv soon-ish, printing it out at least was just better than nothing.)

A final option, since you indicate you are low on time, is to just analyze the forward reads (once again you have really nice quality scores overall so you can push those pretty far). While that does limit your resolution a bit, it's still an option, and your biological signal may still be there.

1 Like

Thanks for the help! I adjusted the parameters and this would hopefully solve the problem. :grin:

Hi @ebolyen

Following up on that, one of my dada2 commands has been running for 10 days now (the one where I also didn't remove the primers properly...). One of the samples within has 7 mil seq reads, while the rest have from 250 000 and below. Now I believe that's why it is taking so long to run. Should I just let it run until it ends (is dada2 dealing with it?), or is there something I should have done before dada to lower the number of sequences to a more realistic value?

Thanks!

Hey @Alex_14262,

DADA2 should be fine, I would let it keep running. Hopefully it will finish soon (or already has, since a weekend has passed since your post).

Typically nearly all of your reads are going to be the same length, so any filtering done by say trunc-len won't really impact the total number of reads in a meaningful way. There isn't much pre-processing to be done beyond that.

An option you should definitely use when you have a lot of data is --p-n-threads which will parallelize the steps after error-model training (which itself is limited to 1 million reads by default anyway).

Good luck!

1 Like

Thanks for the help! :star_struck:

I hope you don't mind me suggesting that it may be useful to include the information about excluding primers before denoising in the Moving Pictures tutorial, especially since in the example the trimming value is 0, which can induce the wrong idea for inexperienced people like me. From what I remember i didn't read that there, so it would save some trouble for all the other beginners who start using qiime, and would save you some time too! Thanks! :smiley:

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.