DADA2 features and frequencies

asr17 · February 14, 2018, 7:47pm

Hi experts,
I am trying to figure out the best settings in DADA2 on my MiSeq PE reads targeting ITS2. In a previous post I found it very helpful, the focus was mainly on the number of features in setting trunc-len DADA2, truncation lengths and features number - #9 by benjjneb

But what about total frequencies? on a subset of my data here is what I got:
using trunc-len-f 250 and trunc-len-r 199: total Features=788 and total Frequencies=170,363
using trunc-len-f 250 and truc-len-r 204 : total features= 804 and total frequencies= 168,588
Which truncation to choose from? One might say go with the higher features but losing sequence counts might affect my low abundance samples.
I appreciate your inputs,
Thanks!

colinbrislawn · February 14, 2018, 8:13pm

Good morning,

Great question. Can you also post the quality score plots for the forward and reverse reads? That's will help me understand what's in those last 5 bp you are trimming off.

My first through it that 5 bp would only make a small difference, and that seems to be true: a change of 16 ASVs and around 1% of your reads is pretty minor. I think you could safely choose either of these results!

Colin

asr17 · February 14, 2018, 8:38pm

Hi @colinbrislawn,

I agree the difference is too small, one can go either way but which one matters most!
Here is my quality scores

. I hope the pic is clear enough.

Thanks for your help!

colinbrislawn · February 14, 2018, 9:23pm

Those pic looks great! How long is your amplicon? The V4 16S region is around 240 bp but this looks like a 200 bp region.

In general, I'm in favor of user strictor cutoffs even if that reduces my number of reads or ASVs. Quality over quantity!

I'm not really sure what else to say. Let's see what other people think.

Colin

asr17 · February 14, 2018, 11:07pm

This is ITS2 region with an amplicon of 340-360bp, after trimming forward and reverse primers I should get roughly ~320 bp in DADA2. In both truncation scenarios, my reads should have enough overlap to be joined.
Yes let's hear what others have to say.
Thanks for your input!

Mehrbod_Estaki · February 15, 2018, 6:33am

My 2 cents. As @colinbrislawn mentioned I think you can choose either of those cases because they produced negligible (in my opinion) differences. I'm guessing the bulk, if not all, of the increase in your features # probably comes from singletons or features with very low frequencies in a handful of samples only. You can check this for yourself. These are typically removed when you are doing differential abundances analyses anyways as they don't really provide much information and increase noise, so it won't affect anything there, and in indexes where they drive alpha-diversity changes see this great discussion about why perhaps those are incorrect to begin with.
Ultimately you can look at the sequencing depth of your samples between the two cases and see if one allows you to retain more samples. In that case, I would stick with that option.

Edit: After re-reading the discussion I linked myself, I realize perhaps most of those feature increase are not singletons per se, as DADA2 tries to avoid those, but I think they would still fall within the realm of very low frequencies within very few samples.

colinbrislawn · February 15, 2018, 8:23pm

OK, so

250 bp per read x 2 reads = 500 bp for each read
500 bp - 51 bp (or 46) = 449 total coverage
449 - 360 = 90 bp of overlap
   (the area between 110 and 200 on reverse read)

While the quality in that area is not great, that should be plenty of space for dada2 to join and error correct these reads.

Let's see what @benjjneb suggests...

Colin

asr17 · February 15, 2018, 8:32pm

Great explanation @colinbrislawn! and thanks @Mehrbod_Estaki for your suggestion, I did check these additional features, what I noticed all of these extra features occurred just once ( in one sample) !!
Normally, I tend to remove features not occurring at least twice in my samples before conducting any statistical analysis.
Now, I am more inclined to stay with the first scenario ( trunc-len-f 250 and trunc-len-r 199).
Thanks!

benjjneb · February 16, 2018, 1:18am

That looks right to me!

The one ITS-specific worry I might have, is that there is more real biological length variation here, and that perhaps a few long real amplicons are lost with the shorter trimming. But I don't know enough about ITS2 to know if that's the case or not, and for 16S I would follow exactly the reasoning @colinbrislawn laid out.