Where to trim reads

Sorry for the delay @ariel! I have been out sick :face_with_thermometer: .

That's a great point! That had not occurred to me, though now that you mention I recall that this primer is often called 8F...

I also do not know about this one, you should check on that @ariel — though I will note it probably starts at 550 and ends at 534 (since this is a reverse primer).

Even at 291f and 240r your median quality scores are 29 and 27, respectively, so I would not worry too much.

For all but 1 of those 12, it looks like the # of input sequences is very very low, so that is not a merging issue and those samples would be lost anyway. For one of these samples (HMP2_J15247_M_ST_T0_B0_0120_ZWFDEY0-01_GES15-03642_L001), there is a large number of input sequences but all are dropped during denoising! So it is likely that that sample failed somehow — you may want to inspect that one individually to see what the quality profile looks like on it.

So I would say this is not a merging issue since many reads are passing, but clearly it still is since you are losing many reads at the merging step. This probably comes down to a few factors:

  1. Are you sure primers are trimmed? I expect so, since we now have reads joining successfully, but you should double-check just to make sure.
  2. We may want to reassess this statement:

While it is true that 16S has relatively low length heterogeneity, this may not be entirely true about all domains — my experience is mostly with V4, which has a very tight length distribution. V1-V3 may be different, and a quick search shows that V3 variability alone may have a range of ~50 nt

Also, the developer of dada2 has recommended up to 80 nt overlap as a safe overestimate whenever possible:

and also that non-target DNA hit by these primers can yield potentially useful results:

All of this above advice may or may not fit the actual length distribution of your amplicons: if you have a gel image or bioanalyzer trace or something else along those lines, use that to calculate the total starting amplicon length distribution.

I am not following this either. How is the R primer different from the F? One way or another the primer length can be subtracted from the total amplicon length if primers have been trimmed, so @ariel's calculations look correct to me — but please elaborate in case I am missing something.

That is the last resort, and easy to do (just use dada2 denoise-single and the reverse reads will be ignored, no need to re-import)

But there are two other options:

  1. as noted above, look at the actual length distribution
  2. Brute force: truncate your sequences less and less (i.e., make longer input sequences) to see if you can get more joined (merged) sequences out of dada2 in spite of the quality drop off (your qualities are good, especially on the forward read. Look at the median accuracy scores to determine where to cut)
  3. Try merging with q2-vsearch to get a "second opinion" (though note that many sequences may not join due to low quality at the 5' ends).

No — you can use the same taxonomy classifier trained to the entire V1-V3 region, no need to train (and training with only the forward primer would not work — you would train with both primers and truncate to the same length as your sequences if you want to be really precise, but that much trimming does not matter much in my experience)

cc:ing @Mehrbod_Estaki — he is working on a new tutorial and this would be a great topic to cover! Thanks for the suggestion @ariel!

@Mehrbod_Estaki may also have some more advice on this, he has tons of experience wrangling with dada2 and noisy datasets!