DADA2 trimming and truncation parameters and subsequent feature count

Hello,

I am currently analyzing 16S V1-V3 sequencing data using QIIME2 (2022.2) to analyze skin microbiomes. Primers used for sequencing were 27F and 534R using 2x300 bp paired-end protocol.

Calculating overlap for trim/trunc parameters:
My amplicon size is 534 - 27 = 507 bp
Overlap would be (300 * 2) - 507 = 93 bp

If my understanding is correct, I can truncate 93bp (combined) from my sequences, not to include the 12bp minimum and ~20bp for variation.

My confusion lies in the subsequent feature counts I receive from my tables after DADA2. I have looked at various truncation parameters from keeping all 300 to truncating to 200 for both strands. I am finding at specific ranges I am still detecting many features despite not being in the range of overlap (assuming I calculated it correctly or my understanding of overlap is correct, which could be my primary issue).

As expected, at a truncation of 200 for both strands, I have 3 features remaining across my 124 samples with a total frequency of 6.

At truncation 275 for both strands (50bp truncation), I have 1097 features with a total frequency of 785K across 124 samples.

However, at truncation 240 for both strands (120bp truncation), I have 1992 features with a total frequency of 2374K across 124 samples.

Shouldn't 240 for both strands be out of the range of my overlap? Why am I getting even more features (both total and frequency) despite being out of the range?

Again, it's very possible I am misunderstanding or miscalculation my overlap.

Thank you for your time and help.

Best,
Daniel

1 Like

Hi @dann818,

Welcome back to the :qiime2: forum!

Great question! There can sometimes be variation in sequence lengths - which could be causing your shorter truncation length to pick up more features than you are expecting.

A couple of next steps I'd recommend:

  1. Take a look at the dada2 stats for more reasons as to why your features are being filtered out. If you don't mind sending those over (either here or in a DM), we can take a look at them as well and see if anything stands out!
  2. Try running dada2 just using your forward reads (i.e. single-end) and see how many reads you get from that.

This is a bit of a balancing game of trying to figure out how to maximize your read length and the number of reads you're getting back.

Hope this helps! Cheers :lizard:

1 Like

Hello @lizgehret

Thank you for your time and explanation. I think this is beginning to make sense. I will prepare some stats and DM them over if you wouldn't mind taking a peek whenever you have time!

Given the high number of reads for my shorter truncation lengths, is there a chance that the features detected are more erroneous than the ones at a longer truncation length (e.g. 240 vs. 275, 240 has more features but they could be due to an error)?

EDIT: Perhaps another way to word it would be, as a general rule (if there even is one!), is the more reads that pass the DADA2 denoising process, the better?

1 Like

Hi @dann818,

Thanks for sending those stats over!

The short answer to this is yes - and the long answer is that the features you're detecting could either be completely erroneous, or they could be actual variable length features. Both of these outcomes are sub-optimal because this isn't an accurate representation of your region of interest.

Yes and no - of course you do want more reads, but you also want to have confidence that the reads passing the filter are reasonable for your region of interest. It's a bit of a balancing act between retaining reads and making sure what you're retaining is what you actually want to analyze.

After looking at your stats, 250 looked to be a pretty reasonable truncation length - you're still getting above 50% passing the filter, and a similar percentage of successfully merged reads but given that it is only 500bp instead of 507bp you are still probably selecting for reads that are shorter in length. So I would not use 250 as a truncation length. Although 275 should be able to merge, a lot of the sequences get thrown out because of the sequence quality. I think there are 3 next steps:

  1. You may consider taking a look at the stats at lengths in between 250 - 275 to see where things start to go south (since 275 has a very low percentage on both the reads that passed the filter, and reads successfully merged).
  2. You may want to try trimming at 275 but messing around with the max-ee or the Max expected errors to see if that stops the sequences from getting filtered out due to low quality.
  3. Lastly I would still recommend that you run single end to see how many reads you get and use that as your baseline for what is possible as you mess around with the parameters.

I hope this helps!

Cheers :lizard:

2 Likes