Where to trim reads

Thank you so much for clarifying everything so thoroughly for me it is extremely helpful. I really value your contribution.

Maybe my question could have been more clear here: how would I know which was better? On what metric should I judge "better"

1 Like

Ah understood. Dada2 will output a stats file with the number of reads input, filtered, denoised, and joined ("merged"). I think the number of "merged" reads is the best indicator of the "best" parameter settings here, so long as you do not see a sharp drop-off at merging.

I hope that helps!

1 Like

Ohh I see. Amazing. Thank you so much for your help!

1 Like

Hi, thank you. As an update, things didn't go very well. I had very little overlap, which I don't quite understand.

I ran three different versions (each took over 24 hours even with an extremely high amount of memory and cpus perhaps because I have well over 1000 sequences)

I have posted the results on dropbox, linked here. I did cutting at

290f, 203r

283f, 209r

290f, 221r

I'm honestly not sure any of them had enough overlap judging from the # of merged reads. For the first two, which had 23 and 22 nt overlap respectively, many of the samples (like 900!) had 0 merged reads. for the second, which had a 41 nt overlap, it was better, but not necessarily great. There were still 800 with less than 1000 reads! And still this is much too low. I'm confused why this happened. Do you have any insight?

I pretty urgently need to finish this, do you have any insights or ideas?

Gratefully,
Ariel

p,s. in the meantime am running it again, cutting at 291f and 226r

Hi, I am just joining the thread and I am not really a guru,
but I don’t think even the new setting of 291f and 226r after trimming will work well and many will still be lost in merging.
because coming from both sides will add up to 517 so overlap will be just 10 bases.

Although I cannot see read quality after demultiplexing and I guess you are worried of where the quality drops off, why not see if u can extend such that u have 530bp or slightly more say 290f + 240r or some other combination

As @WAS1 noted, you are just barely getting enough overlap with those truncation settings.

That setting should work for you @ariel — I re-reviewed your quality plots and it looks like at both those truncation sites the 1st quartile is around Q=17-18, and median is Q=27-30. So that should be fine...

Please give that a try and let us know how it works!

Because coming from both sides will add up to 517 so overlap will be just 10 bases.

It seems that I should have 10 + 17 + 20 = 47 overlap with those settings, because the primers are not included and thus the region is shorter, is this true or not? Given that these reads do not contain the primers, it seems to me that the region is actually 470 bp. Any thoughts?

I will try the suggested cut sites. Thank you

I re-reviewed your quality plots and it looks like at both those truncation sites the 1st quartile is around Q=17-18, and median is Q=27-30. So that should be fine

When I look at my quality plots, the quality drops to even 24 and 22 in several points in the reverse, especially after 228r

Hi, Quick one here though I am yet to take a look at the plots, your 27F primer stops at position 27!! (the primer sequence actually starts at 8F) so does not start at position 27. it is called 27F because of a modification within it to allow a bit of degeneracy. As far as I know. These are all universal primers. isn't it?
So your region calcluation will have tochange
I am sorry the 534R is most probably (because I don't know much about this one) another modification of 519R/518R which will mean it stops at 534 and start at 518

1 Like

Hi WAS1,

Thank you so much for your help! I did not know that about the primers.

In this case, for the region not including the primers, I get 518-27 = 491

the primers don’t seem to be included in the sequences, so it seems to me the 300f and 300r that I’m cutting should be from that 491 region. What is your take on this?

491 - 291f -226r = -26 (26 bases of overlap)

Even cutting there, I still have samples with 0 and low reads, and about 1000 samples with <5000 reads so it looks like I need to increase the overlap anyway, just want to get really clear about the correct to calculate overlap and whether to include the primers.

Am still running the 491f, 240r attempt, though I’m concerned about some of the quality scores in that cutoff

Things better?

I don't think so. yes region with this should be 490/491, I would agree, but by truncating R at 226 after removing 17 leaves u with 209 length so overlap is not 26. the R primer is not like the F primer

Looking at ur quality plots one would understand why u are reluctant to extend.

Hi WAS1, thanks for your help and thank you for asking. Why are you including the primers in the 300. Not sure what you mean by the R primer is not like the forward/not sure how you are calculating that only the reverse should be subtracted., though it does make sense that the 300 would include the primers.

@Nicholas_Bokulich Do you have feedback on this? It would be great to have explicit instructions in the tutorial on how to calculate overlap, and also great if DADA2 spit out some statistic on the overlap as well.

Unfortunately, even if i extend to 291f, and 240 reverse, I still have 12 samples with zero merged reads, and almost 500 with fewer than 5000. Here are the stats from dada2. I’m a bit concerned about how to proceed.

1 Like

Should i at this point try to proceed with the forward reads only?

I’m assuming this would also mean that I need to retrain the classifier with just the forward primer

Sorry for the delay @ariel! I have been out sick :face_with_thermometer: .

That's a great point! That had not occurred to me, though now that you mention I recall that this primer is often called 8F...

I also do not know about this one, you should check on that @ariel — though I will note it probably starts at 550 and ends at 534 (since this is a reverse primer).

Even at 291f and 240r your median quality scores are 29 and 27, respectively, so I would not worry too much.

For all but 1 of those 12, it looks like the # of input sequences is very very low, so that is not a merging issue and those samples would be lost anyway. For one of these samples (HMP2_J15247_M_ST_T0_B0_0120_ZWFDEY0-01_GES15-03642_L001), there is a large number of input sequences but all are dropped during denoising! So it is likely that that sample failed somehow — you may want to inspect that one individually to see what the quality profile looks like on it.

So I would say this is not a merging issue since many reads are passing, but clearly it still is since you are losing many reads at the merging step. This probably comes down to a few factors:

  1. Are you sure primers are trimmed? I expect so, since we now have reads joining successfully, but you should double-check just to make sure.
  2. We may want to reassess this statement:

While it is true that 16S has relatively low length heterogeneity, this may not be entirely true about all domains — my experience is mostly with V4, which has a very tight length distribution. V1-V3 may be different, and a quick search shows that V3 variability alone may have a range of ~50 nt

Also, the developer of dada2 has recommended up to 80 nt overlap as a safe overestimate whenever possible:

and also that non-target DNA hit by these primers can yield potentially useful results:

All of this above advice may or may not fit the actual length distribution of your amplicons: if you have a gel image or bioanalyzer trace or something else along those lines, use that to calculate the total starting amplicon length distribution.

I am not following this either. How is the R primer different from the F? One way or another the primer length can be subtracted from the total amplicon length if primers have been trimmed, so @ariel's calculations look correct to me — but please elaborate in case I am missing something.

That is the last resort, and easy to do (just use dada2 denoise-single and the reverse reads will be ignored, no need to re-import)

But there are two other options:

  1. as noted above, look at the actual length distribution
  2. Brute force: truncate your sequences less and less (i.e., make longer input sequences) to see if you can get more joined (merged) sequences out of dada2 in spite of the quality drop off (your qualities are good, especially on the forward read. Look at the median accuracy scores to determine where to cut)
  3. Try merging with q2-vsearch to get a "second opinion" (though note that many sequences may not join due to low quality at the 5' ends).

No — you can use the same taxonomy classifier trained to the entire V1-V3 region, no need to train (and training with only the forward primer would not work — you would train with both primers and truncate to the same length as your sequences if you want to be really precise, but that much trimming does not matter much in my experience)

cc:ing @Mehrbod_Estaki — he is working on a new tutorial and this would be a great topic to cover! Thanks for the suggestion @ariel!

@Mehrbod_Estaki may also have some more advice on this, he has tons of experience wrangling with dada2 and noisy datasets!

2 Likes

Wow thank you so much for getting back to me and your extremely useful answer. Because it takes 24 hours every time I try to run DADA2, and i have a very finite period of time to finish this, i’m submitting all possible combinations i can think of now and then trying to figure out the best solution in that 24 hour period.

I do not have access to a gel, and I can ask the people that ran it but that may not be information that I can get.

It seems that the region is
534-8 = 526 (though other people refer to this region as being 527 long, not sure why that is.) so if we assume it’s 527,

527 - 600 = 73nt maximum overlap even if I keep all of the sequences.

The forward drops to 25 after 291, and then eventually drops to 18 (median quality) after 300, the reverse drops to 25 after 227, and then drops to ~17-24 after 240, and further drops to 15 after 273.

527-300-273 = 46 nt overlap

though also I’m a bit confused over whether I would also subtract the length of the primers? they aren’t included in these sequences, which i’ve ascertained by searching for them (and parts of them) within the files. if so, that would be 83 nt of overlap.

I wasn’t sure if including bases of such low quality (as low as 16 and 17 median) was questionable. What do you think is better between including low quality bases and using only the forward reads?

I’m running that version now, along with the only forward reads, trimmed at 283 (which is where the median quality dips below 30)

I hope that you are feeling better!

Sounds reasonable. And running just the forward reads is a great idea just in case (and if nothing else, it is a useful comparison)

:+1:

Based on all the above discussion, I would think yes subtract the primer lengths from the total amplicon length you have cited, unless if you aren't sure whether that length includes the primers. (to be on the safe side perhaps we should just assume that it does include the primers and you so would not subtract them).

It is lower that I like personally, but really does not sound bad. I would personally opt for fewer longer sequences than more shorter sequences, so long as I have enough reads.

So it is definitely worth running all ways and compare the stats output to decide what works best for you.

You could also experiment with the --p-trunc-q parameter instead — since you are joining pair-end reads it may prove useful here for getting more length out of higher-quality reads, instead of trimming all at the same length.

Sorry this is such a difficult trial-and-error process! The good news is all steps after denoising are usually easier..

I hope that helps!.

Okay, thank you. I also tried one cutting at 293, which has two 13 bases and a 14.

I guess i’m wondering, if I can get enough reads out of this, is using lower quality + paired end still better than just using the forward?

In terms of --p-trunc-q

“Reads are truncated at the first instance of
a quality score less than or equal to this
value. If the resulting read is then shorter
** than trunc_len_f or trunc_len_r**
** (depending on the direction of the read) it**
** is discarded.”**

I’m wondering what you think would be a useful way to use this paramter in this context.

That's very subjective. I'd say no, if you don't get enough reads then shorter is better (and I have gone that route when in your position). But if it is a matter of losing some samples that maybe are not critical... then it becomes a balance of priorities.

You could just omit the trunc_len parameters and I think this should still work (and rely on the read merging stage to decide whether you have trimmed too much). But you could also set this to settings that you KNOW will not work (with wiggle room for length variation and the possibility that one read is high quality and truncated less while the paired read may be truncated more within reason).... so maybe 200 nt in this case?

I hope that helps! Please share the stats results when all runs are done and we can help make a decision.

Hello! Excitingly it worked a lot better with 300f and 273r. The person that did the sequencing didn’t think there was any one length of the region that could be calculated, but noted that because it’s a variable region it can vary quite a bit, even to 550nt. So in the case of even 550nt, it looks like this cutoff would still give 26nt of overlap

What do you think about the fact that several lower quality bases had to be used with this cutoff? It seems to me that this is just what had to be done with this particular data? And it seems fairly high quality overall. I’m moving forward with this one.

Even with this cutoff, if i want to use only samples with say > 10,000 reads, I would have to discard a couple hundred, and about 100 if I want use samples with > 5000 reads, though this may be okay.

Interestingly, I also tried cutting at 300f and 292r, and despite the additional overlap, it actually did not merge as well as 300f and 273r, in fact it was much worse presumably because the sequence quality really drops off after 273.

The forward only attempt also worked well, and there I would need to discard fewer samples, however I’d like to use paired end if possible since my reading indicates that it is better able to align to reference sequences and detects repetitive elements and rearrangements

Thank you again for all of your help!!

p.s. I also tried several different quality cutoffs 20, 24, 26, and specified 200 f and r, however none of those worked. They all had mostly 0, and a couple 2 merged non-chimeric reads. Anyway, using the forward and reverse cuts that I did was great, just wanted to share that as well.

:tada: That's fantastic! This does look a lot better. You are still losing some sequences during merging, but this is a small fraction so probably doesn't matter.

If you can, it may be interesting to run all the same downstream analyses (or at least look at taxonomic composition and beta diversity) on both the single-end and these paired data to make sure this does not matter. But don't sweat it.

It is fairly high-quality. And put it this way: if including the lower-quality bases were a problem, dada2 would let you know! (by dropping at the filtering step). You have lots of reads output, so can move on.

Longer sequences contain more information, so it is always better when possible.

Thanks for sharing! Very glad you were able to get this working in the end (it is almost always a maddening trial-and-error process!)

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.