Hi experts,
I am trying to figure out the best settings in DADA2 on my MiSeq PE reads targeting ITS2. In a previous post I found it very helpful, the focus was mainly on the number of features in setting trunc-len DADA2, truncation lengths and features number
But what about total frequencies? on a subset of my data here is what I got:
using trunc-len-f 250 and trunc-len-r 199: total Features=788 and total Frequencies=170,363
using trunc-len-f 250 and truc-len-r 204 : total features= 804 and total frequencies= 168,588
Which truncation to choose from? One might say go with the higher features but losing sequence counts might affect my low abundance samples.
I appreciate your inputs,
Thanks!
Great question. Can you also post the quality score plots for the forward and reverse reads? That’s will help me understand what’s in those last 5 bp you are trimming off.
My first through it that 5 bp would only make a small difference, and that seems to be true: a change of 16 ASVs and around 1% of your reads is pretty minor. I think you could safely choose either of these results!
This is ITS2 region with an amplicon of 340-360bp, after trimming forward and reverse primers I should get roughly ~320 bp in DADA2. In both truncation scenarios, my reads should have enough overlap to be joined.
Yes let’s hear what others have to say.
Thanks for your input!
My 2 cents. As @colinbrislawn mentioned I think you can choose either of those cases because they produced negligible (in my opinion) differences. I’m guessing the bulk, if not all, of the increase in your features # probably comes from singletons or features with very low frequencies in a handful of samples only. You can check this for yourself. These are typically removed when you are doing differential abundances analyses anyways as they don’t really provide much information and increase noise, so it won’t affect anything there, and in indexes where they drive alpha-diversity changes see this great discussion about why perhaps those are incorrect to begin with.
Ultimately you can look at the sequencing depth of your samples between the two cases and see if one allows you to retain more samples. In that case, I would stick with that option.
Edit: After re-reading the discussion I linked myself, I realize perhaps most of those feature increase are not singletons per se, as DADA2 tries to avoid those, but I think they would still fall within the realm of very low frequencies within very few samples.
250 bp per read x 2 reads = 500 bp for each read
500 bp - 51 bp (or 46) = 449 total coverage
449 - 360 = 90 bp of overlap
(the area between 110 and 200 on reverse read)
While the quality in that area is not great, that should be plenty of space for dada2 to join and error correct these reads.
Great explanation @colinbrislawn! and thanks @Mehrbod_Estaki for your suggestion, I did check these additional features, what I noticed all of these extra features occurred just once ( in one sample) !!
Normally, I tend to remove features not occurring at least twice in my samples before conducting any statistical analysis.
Now, I am more inclined to stay with the first scenario ( trunc-len-f 250 and trunc-len-r 199).
Thanks!
The one ITS-specific worry I might have, is that there is more real biological length variation here, and that perhaps a few long real amplicons are lost with the shorter trimming. But I don't know enough about ITS2 to know if that's the case or not, and for 16S I would follow exactly the reasoning @colinbrislawn laid out.