Where to trim reads

ariel · August 23, 2018, 8:43pm

Hello, thank you again for providing this resource and answering so many questions!

I read a number of other questions on this topic, however I'm hoping that someone can sanity check my primer calculations and trimming.

the V1 through V3 hyper-variable regions (V1-V3) of 16S were amplified using primers 27F and 534R (27F:5'-AGAGTTTGATCCTGGCTCAG-3' and 534R: 5'- ATTACCGCGGCTGCTGG-3')

I'm calculating the length of the sequenced region as
534-27= 507,
and from there calculating the overlap as
507- 17 - 20 (primers) - 283 (forward length) - 203 (reverse length) = -16

Am I correct in my calculation that this would be only a 16 nt overlap, which is not enough? Does this mean I need to retain an additional 4 bases regardless of the quality? I am trying to keep the median above 30.

I have the choice of

extending the forward to 290 and keeping two quality 29 and one quality 28, for a total 23 nt overlap,
extending the reverse to 209, and retain one quality 27 and one quality 29 for a total 22 nt overlap.

I'm currently running both. Is one of these better than the other? What is the best way to make these tradeoffs?

I made a visualization of my sequences and uploaded to dropbox.

Also, Calculate overlap base pairs

"the required overlap you want to keep is 20nt+ natural variation in your target. For example if your expected target of 440bp can also be 430 bps, then you would need 20nt+10nt minimum overlap for proper merging"

How can I know the natural variation in my target?

Nicholas_Bokulich · August 24, 2018, 2:25pm

You do not need to subtract primers in your overlap calculation, since the truncation is occurring at the 3' end of each read... any amount you trim from the 5' end (e.g., primers) does not impact the overlap.

So you should have 507 - 283 - 203 = 21 nt overlap! Perfect.

You could simulate the amplicons from a reference database and look at the length distribution... but there's not much point, since you will not know the actual variation in your samples (since you don't know the actual species present) and there is so little variation in 16S length that it does not really matter all that much compared to something like ITS.

I hope that helps!

ariel · August 24, 2018, 4:44pm

Hi Nicholas,

Thank you so much for your help.

The reason I subtracted the primers is because I was thinking that if they are cut off (which they appear to be because I cannot find them in the sequences upon searching) the region is then smaller? AKA they aren't included in the sequence lengths I'm seeing?

I'm a little bit confused about the way you calculated the overlap, because if the entire region is 507 and I have 283 forward and 203 reverse, wouldn't they not quite reach each other? The way I'm imagining it is like below, what am I missing?

283 ----------------------
507-------------------------------------------
... .... .... .... .... .... ... ... .--------------- 203

Does this mean that a 20 nt overlap is sufficient for the purpose of generic 16S analysis?

Nicholas_Bokulich · August 24, 2018, 4:50pm

Ha! I can't do basic arithmetic that early in the morning.

You were correct initially — subtract primers from the total sequence length.

I guess that means I should answer your original question:

Run both and see which works best. It looks like you could get a bit more mileage out of those forward reads, so see if you can do so without losing too many reads due to error...

Yes

Sorry I got it wrong the first time! Let me know how your test runs go!

ariel · August 24, 2018, 4:57pm

Thank you so much! And yes basic arithmetic can be a challenge for us all! especially in the morning!

What exactly do you mean by this? And what do you mean get a bit more mileage out of the forward reads? and losing too many reads due to "error"? Also, does it matter if I lose some/many reads if I have a lot to start with?

From Dada2 output for one of the versions I'm running (it's taken 20 hours so far so this is what I have so far):
Initializing error rates to maximum possible estimate.

Sample 1 - 62041 reads in 22203 unique sequences.
Sample 2 - 74662 reads in 22874 unique sequences.
Sample 3 - 60757 reads in 18857 unique sequences.
Sample 4 - 74723 reads in 23412 unique sequences.
Sample 5 - 68176 reads in 23701 unique sequences.
Sample 6 - 91649 reads in 32816 unique sequences.
Sample 7 - 66666 reads in 20898 unique sequences.
Sample 8 - 87004 reads in 24568 unique sequences.
Sample 9 - 52548 reads in 14171 unique sequences.
Sample 10 - 83251 reads in 25052 unique sequences.
Sample 11 - 72178 reads in 21779 unique sequences.
Sample 12 - 82481 reads in 26951 unique sequences.
....
2b) Reverse Reads

Initializing error rates to maximum possible estimate.
Sample 1 - 62041 reads in 23491 unique sequences.
Sample 2 - 74662 reads in 20043 unique sequences
Sample 3 - 60757 reads in 17475 unique sequences.
Sample 4 - 74723 reads in 18901 unique sequences.
Sample 5 - 68176 reads in 19530 unique sequences.
...

Nicholas_Bokulich · August 24, 2018, 5:13pm

You said you were running two different analyses and asking which would work better... I'm just proposing that we can figure this out empirically. You are already running both so let's just let the results answer for us. My money is on #1, though, because:

Sorry, bad metaphor. It looks like the forward reads have higher quality for longer and you can extend the truncation length a bit. So truncating the forward reads at 290 sounds swell. Extending the trunc length on the reverse seems less possible.

If you extend the trunc length too far, you are potentially including more erroneous bases (since quality drops at the 3' end). Dada2 does an initial pass to filter out reads with too many expected errors (see the --p-max-ee parameter). So extending trunc length --> more expected errors --> more reads dropped prior to denoising.

Looks like you have lots of input reads per sample — I would not be too worried. So increasing trunc length and accepting that some reads will be filtered before denoising is fine... it is better than not quite having enough sequence to join paired-end reads, in which case the sequences you are losing could actually bias the results (because, e.g., amplicons that are 507 nt long will join, but maybe 509 will not! this will impact some taxa more than others, which may have biological importance, seriously biasing your results. Losing reads to the pre-filter will not bias your results, since presumably the errors are random and not related to the content of the sequences)

ariel · August 24, 2018, 5:26pm

Thank you so much for clarifying everything so thoroughly for me it is extremely helpful. I really value your contribution.

Maybe my question could have been more clear here: how would I know which was better? On what metric should I judge "better"

Nicholas_Bokulich · August 24, 2018, 5:34pm

Ah understood. Dada2 will output a stats file with the number of reads input, filtered, denoised, and joined ("merged"). I think the number of "merged" reads is the best indicator of the "best" parameter settings here, so long as you do not see a sharp drop-off at merging.

I hope that helps!

ariel · August 24, 2018, 8:18pm

Ohh I see. Amazing. Thank you so much for your help!

ariel · August 26, 2018, 7:29am

Hi, thank you. As an update, things didn't go very well. I had very little overlap, which I don't quite understand.

I ran three different versions (each took over 24 hours even with an extremely high amount of memory and cpus perhaps because I have well over 1000 sequences)

I have posted the results on dropbox, linked here. I did cutting at

290f, 203r

283f, 209r

290f, 221r

I'm honestly not sure any of them had enough overlap judging from the # of merged reads. For the first two, which had 23 and 22 nt overlap respectively, many of the samples (like 900!) had 0 merged reads. for the second, which had a 41 nt overlap, it was better, but not necessarily great. There were still 800 with less than 1000 reads! And still this is much too low. I'm confused why this happened. Do you have any insight?

I pretty urgently need to finish this, do you have any insights or ideas?

Gratefully,
Ariel

p,s. in the meantime am running it again, cutting at 291f and 226r

WAS1 · August 27, 2018, 10:28am

Hi, I am just joining the thread and I am not really a guru,
but I don't think even the new setting of 291f and 226r after trimming will work well and many will still be lost in merging.
because coming from both sides will add up to 517 so overlap will be just 10 bases.

Although I cannot see read quality after demultiplexing and I guess you are worried of where the quality drops off, why not see if u can extend such that u have 530bp or slightly more say 290f + 240r or some other combination

Nicholas_Bokulich · August 27, 2018, 2:15pm

As @WAS1 noted, you are just barely getting enough overlap with those truncation settings.

That setting should work for you @ariel — I re-reviewed your quality plots and it looks like at both those truncation sites the 1st quartile is around Q=17-18, and median is Q=27-30. So that should be fine...

Please give that a try and let us know how it works!

ariel · August 27, 2018, 6:16pm

Because coming from both sides will add up to 517 so overlap will be just 10 bases.

It seems that I should have 10 + 17 + 20 = 47 overlap with those settings, because the primers are not included and thus the region is shorter, is this true or not? Given that these reads do not contain the primers, it seems to me that the region is actually 470 bp. Any thoughts?

I will try the suggested cut sites. Thank you

I re-reviewed your quality plots and it looks like at both those truncation sites the 1st quartile is around Q=17-18, and median is Q=27-30. So that should be fine

When I look at my quality plots, the quality drops to even 24 and 22 in several points in the reverse, especially after 228r

WAS1 · August 27, 2018, 9:43pm

Hi, Quick one here though I am yet to take a look at the plots, your 27F primer stops at position 27!! (the primer sequence actually starts at 8F) so does not start at position 27. it is called 27F because of a modification within it to allow a bit of degeneracy. As far as I know. These are all universal primers. isn't it?
So your region calcluation will have tochange
I am sorry the 534R is most probably (because I don't know much about this one) another modification of 519R/518R which will mean it stops at 534 and start at 518

ariel · August 27, 2018, 10:37pm

Hi WAS1,

Thank you so much for your help! I did not know that about the primers.

In this case, for the region not including the primers, I get 518-27 = 491

the primers don't seem to be included in the sequences, so it seems to me the 300f and 300r that I'm cutting should be from that 491 region. What is your take on this?

491 - 291f -226r = -26 (26 bases of overlap)

Even cutting there, I still have samples with 0 and low reads, and about 1000 samples with <5000 reads so it looks like I need to increase the overlap anyway, just want to get really clear about the correct to calculate overlap and whether to include the primers.

Am still running the 491f, 240r attempt, though I'm concerned about some of the quality scores in that cutoff

WAS1 · August 28, 2018, 5:11pm

Things better?

I don't think so. yes region with this should be 490/491, I would agree, but by truncating R at 226 after removing 17 leaves u with 209 length so overlap is not 26. the R primer is not like the F primer

Looking at ur quality plots one would understand why u are reluctant to extend.

ariel · August 28, 2018, 8:43pm

Hi WAS1, thanks for your help and thank you for asking. Why are you including the primers in the 300. Not sure what you mean by the R primer is not like the forward/not sure how you are calculating that only the reverse should be subtracted., though it does make sense that the 300 would include the primers.

@Nicholas_Bokulich Do you have feedback on this? It would be great to have explicit instructions in the tutorial on how to calculate overlap, and also great if DADA2 spit out some statistic on the overlap as well.

Unfortunately, even if i extend to 291f, and 240 reverse, I still have 12 samples with zero merged reads, and almost 500 with fewer than 5000. Here are the stats from dada2. I'm a bit concerned about how to proceed.

ariel · August 28, 2018, 11:56pm

Should i at this point try to proceed with the forward reads only?

I'm assuming this would also mean that I need to retrain the classifier with just the forward primer

Nicholas_Bokulich · August 29, 2018, 12:48am

Sorry for the delay @ariel! I have been out sick .

That's a great point! That had not occurred to me, though now that you mention I recall that this primer is often called 8F...

I also do not know about this one, you should check on that @ariel — though I will note it probably starts at 550 and ends at 534 (since this is a reverse primer).

Even at 291f and 240r your median quality scores are 29 and 27, respectively, so I would not worry too much.

For all but 1 of those 12, it looks like the # of input sequences is very very low, so that is not a merging issue and those samples would be lost anyway. For one of these samples (HMP2_J15247_M_ST_T0_B0_0120_ZWFDEY0-01_GES15-03642_L001), there is a large number of input sequences but all are dropped during denoising! So it is likely that that sample failed somehow — you may want to inspect that one individually to see what the quality profile looks like on it.

So I would say this is not a merging issue since many reads are passing, but clearly it still is since you are losing many reads at the merging step. This probably comes down to a few factors:

Are you sure primers are trimmed? I expect so, since we now have reads joining successfully, but you should double-check just to make sure.
We may want to reassess this statement:

While it is true that 16S has relatively low length heterogeneity, this may not be entirely true about all domains — my experience is mostly with V4, which has a very tight length distribution. V1-V3 may be different, and a quick search shows that V3 variability alone may have a range of ~50 nt

Also, the developer of dada2 has recommended up to 80 nt overlap as a safe overestimate whenever possible:

and also that non-target DNA hit by these primers can yield potentially useful results:

All of this above advice may or may not fit the actual length distribution of your amplicons: if you have a gel image or bioanalyzer trace or something else along those lines, use that to calculate the total starting amplicon length distribution.

I am not following this either. How is the R primer different from the F? One way or another the primer length can be subtracted from the total amplicon length if primers have been trimmed, so @ariel's calculations look correct to me — but please elaborate in case I am missing something.

That is the last resort, and easy to do (just use dada2 denoise-single and the reverse reads will be ignored, no need to re-import)

But there are two other options:

as noted above, look at the actual length distribution
Brute force: truncate your sequences less and less (i.e., make longer input sequences) to see if you can get more joined (merged) sequences out of dada2 in spite of the quality drop off (your qualities are good, especially on the forward read. Look at the median accuracy scores to determine where to cut)
Try merging with q2-vsearch to get a "second opinion" (though note that many sequences may not join due to low quality at the 5' ends).

No — you can use the same taxonomy classifier trained to the entire V1-V3 region, no need to train (and training with only the forward primer would not work — you would train with both primers and truncate to the same length as your sequences if you want to be really precise, but that much trimming does not matter much in my experience)

cc:ing @Mehrbod_Estaki — he is working on a new tutorial and this would be a great topic to cover! Thanks for the suggestion @ariel!

@Mehrbod_Estaki may also have some more advice on this, he has tons of experience wrangling with dada2 and noisy datasets!

ariel · August 29, 2018, 1:21am

Wow thank you so much for getting back to me and your extremely useful answer. Because it takes 24 hours every time I try to run DADA2, and i have a very finite period of time to finish this, i'm submitting all possible combinations i can think of now and then trying to figure out the best solution in that 24 hour period.

I do not have access to a gel, and I can ask the people that ran it but that may not be information that I can get.

It seems that the region is
534-8 = 526 (though other people refer to this region as being 527 long, not sure why that is.) so if we assume it's 527,

527 - 600 = 73nt maximum overlap even if I keep all of the sequences.

The forward drops to 25 after 291, and then eventually drops to 18 (median quality) after 300, the reverse drops to 25 after 227, and then drops to ~17-24 after 240, and further drops to 15 after 273.

527-300-273 = 46 nt overlap

though also I'm a bit confused over whether I would also subtract the length of the primers? they aren't included in these sequences, which i've ascertained by searching for them (and parts of them) within the files. if so, that would be 83 nt of overlap.

I wasn't sure if including bases of such low quality (as low as 16 and 17 median) was questionable. What do you think is better between including low quality bases and using only the forward reads?

I'm running that version now, along with the only forward reads, trimmed at 283 (which is where the median quality dips below 30)

I hope that you are feeling better!