Truncating parameters on DADA2 (paired-end demultiplexed reads)

slh277 · January 8, 2018, 3:18pm

Hi, I am working with paired end reads that have been demultiplexed but have not undergone any other processing by the sequencing company (so primers/adapters are still attached). I imported using Phred33 manifest file. The graphs of the unjoined sequences are below:

Based on this comment, I ran the following DADA2 command, using a smaller truncation parameter for the reverse reads based on the graph (Do I need to remove the primers/adapters first? I did not... I see from this post that you can set the truncation parameters to remove the primer sequences, but I'm not sure how to incorporate that into also trimming for quality/denoising):

$ qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-paired-end.qza
--p-trim-left-f 0
--p-trunc-len-f 270
--p-trim-left-r 0
--p-trunc-len-r 120
--o-representative-sequences rep-seqs.qza
--p-n-threads 0
--o-table table-dada2.qza

The total frequency for my 53 samples is from the FeatureTable is: 4,467. Compared to the Moving Pictures Tutorial frequency of 157,298, I am not sure if my truncation parameters are too stringent (in fact for one of my samples, there is a frequency of 0)? Is 270 (forward) and 120 (reverse) reasonable based on the graphs above?

For paired-end analysis, is DADA2 or the recent paired-end tutorial better to use and for what reason?

Many thanks for the help provided thus far! Any guidance on the above from more experienced analysts would be very appreciated. Thank you!

Nicholas_Bokulich · January 8, 2018, 4:58pm

Hi @slh277,

Yes, these should be removed, otherwise they can interfere with dada2's error model (e.g., adapter sequences can make your reads look like chimera).

Set the trim-left-f and trim-left-r parameters to the lengths of the adapters/barcodes/primers (whatever needs to be trimmed from the 5' end of each read).

I think you could probably extend the truncation length. The reverse read quality does start dropping off after 120 bases, but mean phred scores still stay pretty high until around 200 bases... so I'd pick 200 and see how this changes your yield.

Just use dada2. That tutorial is just for deblur/otu picking methods. Joining prior to dada2 will disrupt the error model (because it relies on the raw phred scores for each read separately).

I hope that helps!

slh277 · January 8, 2018, 5:26pm

Thank you for your super quick response @Nicholas_Bokulich! I will try this again based on your advice.

Mehrbod_Estaki · January 8, 2018, 8:48pm

Hi @slh277,
In the original comment I made about the 120 bp quality drop-off point, I just assumed if you were to truncate your reverse reads at that point then there wouldn't be enough overlap to actually merge the paired ends. That's why I originally mentioned that perhaps the forward reads alone be adequate, especially since they seem to be very good in quality. In hindsight I made that comment without taking into consideration your targeted region and primer sets.
I guess the point to emphasize would have been to ensure that following your trimming choices, there be enough overlap to properly join them. See this comment regarding the required overlap for DADA2. That being said, I am however curious how extending your reverse read trunc parameter to 200 (as was suggested above) affects the outcome, so if you could report back on that, that would be very appreciated and would help with my own truncating decisions in the future as well.

slh277 · January 9, 2018, 2:03am

Hi! I really appreciate your time, thoughts and insights on my analysis @Mehrbod_Estaki I'm definitely just getting my feet wet with this type of data and analysis and am grateful for all insight - your comment helped me see where the sequences might be considered low quality, and that makes sense re: truncating and having enough overlap. Please see below for the new results - using a V3/V4 primer set (341F, 806R)

$ qiime dada2 denoise-paired --i-demultiplexed-seqs demux-paired-end.qza --p-trim-left-f 50 --p-trunc-len-f 270 --p-trim-left-r 55 --p-trunc-len-r 200 --o-representative-sequences rep-seqs.qza --p-n-threads 0 --o-table table-dada2.qza

Total frequency: 1,104,844

So the frequency seems to be more reasonable! My table.qzv file is attached if you're so inclined to take a look at all of the parameters.

Thanks again for your help!! hope all is going well for your own analysis

table.qzv (425.8 KB)

Mehrbod_Estaki · January 9, 2018, 2:58am

Glad they were helpful in some way @slh277! I'm also happy to hear that the 200 truncating point provided with some good results, this further fuels my suspicions about my own work that I should be a bit more lenient with my cut-off points. Learning by helping!
The better yield here compared to your last run is mainly because of removing the primers/adapters but if you do happen to run this data again with a different/lower truncate parameter I would be very keen again to compare! The point I'm trying to test out for myself is perhaps lowering the truncating point to something like 150-160bp might decrease the resolution of the ASVs slightly but might ultimately produce higher number of features. This would be helpful in instances where you're trying to rescue some sample, for example your sample-51 could end up with more than ~500 ASVs so you wouldn't have to remove it in downstream analysis. Just might make some stats tests easier/more reliable downstream.
Disclaimer! This is purely my own speculation and I am by no means an expert on the topic, so if I'm totally wrong hopefully someone corrects me

I had a quick look at your summary table and it looks pretty good to me, though not knowing the source of the samples and the experiment design its hard to say.

It does look like all but sample-51 have decent coverage (at least for something like gut/stool sample). Depending on your experiment, you might consider removing that one if you can. I also noticed that there a lot of features that occur less than 10 times and a bunch only occur in only 1 sample. These are some potential filtering points you might want to consider when moving on to differential abundance testing (i.e. ANCOM) or other tests that might be sensitive to this type of noise. Again, totally depending on your question, might be that those rare features are what you are actually interested in!

All the best!

slh277 · January 9, 2018, 2:15pm

Hi @Mehrbod_Estaki, this feedback is super helpful - thanks! I will rerun DADA2 using truncation at 120, 150, and 160, with the trimming parameters on this dada to see what the differences are (it would be good to see the comparisons for myself as well! Unfortunately it takes a lot of CPU power/time so I might run these overnight over the next few days and post the results here while exploring the 200 truncation for the time being.

That's also good to know re: sample-51 and getting enough features - I will have to keep this in mind -- as well as your points regarding other filtering points for possible noise. Thanks for mentioning these factors!!

Nicholas_Bokulich · January 9, 2018, 2:39pm

You could take a random subsample (e.g., 5%) of your data to run quick tests on these. I don't think there's a way to do this on demux seqs in QIIME2 (currently), but you could subsample your raw fastqs (before importing) with something like this and then run through QIIME2.

slh277 · January 9, 2018, 3:05pm

@Nicholas_Bokulich Thanks so much for this tip and all of your help on my questions!! Really appreciate it!

slh277 · January 11, 2018, 9:04pm

Hi @Mehrbod_Estaki,

Please see below for comparisons of varying truncation lengths of the reverse read: I took the same demux-paired-end.qza file and ran DADA2 on it three times to generate the 3 new FeatureTables.

Reverse read truncation at...

....at 120:
total frequency: 15,054
median frequency per feature: 4.0
sample-51: sequence count=80
lowest: sample-1 = 12

... at 150:
total frequency: 15,441
median frequency per feature: 4.0
sample-51: sequence count=81
lowest: sample-1 = 20

...and at 160:
total frequency: 15,477.
median frequency per feature: 4.0
sample-51: sequence count=65
lowest: sample-1 = 20

And again at 200:
total frequency: 1,104,844
median frequency per feature: 13.0
sample-51 sequence count: 455 (lowest) and sample 1 = 43,692.

(code used:
$ qiime dada2 denoise-paired --i-demultiplexed-seqs demux-paired-end.qza --p-trim-left-f 50 --p-trunc-len-f 270 --p-trim-left-r 55 --p-trunc-len-r [###] --o-representative-sequences rep-seqs.qza --p-n-threads 0 --o-table table-dada2.qza)

This is pretty eye-opening - it seems like even 160 impacts the variation in sequences quite a bit compared to 200. And not sure what's going on with sample 1 jumping from the lowest total frequency to the highest.

If you have any other thoughts or additional parameters to try out, I'm all ears

Mehrbod_Estaki · January 11, 2018, 9:47pm

Thanks for reporting back on this! @slh277
So I believe the big difference between 120,150,160 and the 200 cut offs will be that only with the 200 one were you able to maintain enough overlap (based on your primer sets) to actually join your forward and reverse reads. What is happening is wherever a pair fails to join because insufficient overlap is just being discarded, leaving you with very few reads in those lower truncating tests.
I feel silly not having mentioned this earlier(sorry!), but the proper way to compare this would be to look at those lower parameters (i.e. 120, 160) in one direction only (i.e the reverse reads).

For example comparing:

paired end reads (270 forward, 200 reverse)
reverse only (120)
reverse only (160)

You don't need to spend any more time on this (unless you really want to) I will do some exploring of this myself when I find some time and post the results.
Thanks again!

system · February 18, 2018, 8:17pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.