dada2 truncation experiment discussion

Labm · April 26, 2022, 2:48am

Hi all,

I have trawled through the truncation posts of this forum again (in addition to my own which I made previously) and I was hoping for another discussion.. (I understand this topic has been beaten to death at this point, and perhaps I am a bit stupid because my results don't seem to match up with the expected - but please humour me so I can move on with my life!)

I have run a series of dada2 experiments using different truncation lengths on my full dataset.
2x300 bp V3-V4 expected size (according to the papers and other forum posts) is ~464 bp.

These were my results (I am providing output from only one sample, but the trend was similar for all samples). The calculation for the overlap I used is -> (300 + 300 - 464 - trun bp F - trun bp R = overlap remaining). Previous posts state we need 20 bp overlap minimum, but I understand that 12 bp is fine now.

What I am having trouble wrapping my head around is why I am not losing significantly more reads at the merging step when I have a predicted -19 bp overlap (i.e. no overlap using the 248/197 test for example). Then, on the other hand, why am I not gaining significantly more reads when I choose parameters (such as the 265 and 220 test) that provide a sufficient overlap (21 bp in this case).

Sure, the 16s rRNA gene varies in length but I would expect that to be reflected in my results. From previous experiments I know that once I choose a much smaller truncation value I lose all the reads. But here be it a -19 overlap or a +21... there is only 3% difference in final output (with the minus overlap test providing the highest output!).

Instead of spending hours pondering your amplicon length and quality scores, is it better to just run a series of experiments to see what trunc. lengths provide you the most sequences (after filtering, merging, and the chimeric removal). Obviously in my case, it seems logical to select the parameters with a -19 bp overlap because this data provides me with the highest output, with the highest median Q values (but also it makes no sense why and hence, I am hesitant that I am biasing my dataset).

Main Question: What have I gotten wrong/missed? Is it due to the calculation I performed? I haven't taken into account that I used cutadapt so my forward and reverse reads are actually 283 and 280, respectively. Is this the problem?

llenzi · April 26, 2022, 9:04am

Hi @Labm,
thank you so much to report this!. I am not sure what is happening but I will try to give my 2 cents.
I agree that in the way you put, this look odd! What I would like to ask you is if you went further in the analysis of these denoised reads, do the taxonomic assignment for all these look similar?
Do you have positive control in these data?
I would be worried that, although if you get 'more' merged reads by trimming more, these merged are due to spurious merging, allowed by the mismatch threshold allowed by the default settings in the overlap region. Given your high quality sequences (because I guess in tour sequences after trimming the quality is very high), did you try to set for not allowing any mismatch at all? I would expect your merging ratio would drop in this case, giving results in line with what you expect in your initial though.

Cheers
Luca

Labm · April 26, 2022, 9:38am

Thanks Luca,

I am in the process of moving forward both the 265/220 run and the 248/197 run (will post results once done). I have positive controls and will look to see if the taxa expected are present once done.

This is my worst quality run so although the medians are around 30, in some cases it drops in the reads proceeding that point. 248/197 was chosen as those are the reads before ANY quality drop occurs. In other cases, when I chose higher truncation numbers, as mentioned, sometimes it drops slightly then increases again.

To clarify, what is the additional command to remove spurious merging? Is the below correct
--p-max-ee-f 0
--p-max-ee-r 0

EDIT

I just want to clarify the original calculation I used to calculate the overlap -> I tried drawing out the problem so I could visualise it better and came to a different conclusion when thinking about the impact of the primers.

Here are the sequence stats comparing 265_220 vs 265_190

Here is the taxonomy barplot results from one of the positive controls. It is the commercial Zymo kit, hence I replaced the ASVs with the expected species. I compared the 265 220 test to the 265 190 test. Little difference imo.

Edit #2

I also looked at all the taxa detected between the two sets of data (via the taxa labels across the barplot .csv). The 265_220 data had 5 taxa present in at least 1 sample that were not present in the 265_190 data. The 265_190 data and 25 taxa present in at least 1 sample that were not present in the 265_220 data.

I still have no idea which truncation length is "better". Is the 265_190 producing spurious merges (hence the 25 additional taxa?) Is the 265_220 data identifying longer reads that were discarded in the other run? perhaps...

llenzi · May 9, 2022, 4:31pm

Hi @Labm,
I am sorry to take so long to answer you!
I am glad to see that the two different trimming options are resulting in very similar profiles!
For this reason, I would suggest you to use 265_220, the only reason being that a negative overlap may be questioned by reviewers at publication time, while a more usual positive number may be more easily accepted!
Unless in the meantime you have further evidence on differences impacting on the identified bacteria.
Did you identified the read belonging to the 25 taxa in the 265_190 set? Do they look real by blasting them?
Cheers
Luca

system · June 9, 2022, 10:31pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.