Is there a "too short" truncation length in dada2?

Hello,

I know there Is an abundance of threads on the topic of truncation length in dada2 but I still couldn't answer my question after thorough reading.

This is my quality plot for an Illumina Miseq V3-V4 run. Due to the poor quality I opted to only look at forward reads (not enough high quality overlap for merging). So I tried different truncation settings in dada2 trying to maximize my output.

I plotted the different results here side by side. According to this, I would chose --p-trunc-len 50 as my go-to setting. (length-50 yields slightly more non-chimeric reads than length-30 or length-100
Can such a short input lead to problems in the downstream analyses? Thank you in advance!

Hello Jonas,

Thank you for doing a thorough lit review! Lots of the DADA2 trimming questions are about how to make the reads merge, so your question is unique.

In your case, the trade-off is between more reads going into ASVs or longer ASVs.

Number of reads increases sensitivity to changes in alpha and beta diversity and gives you more power for differential abundance testing. But it sounds like you already know that keeping more reads is important

Longer ASVs let you use more of the V3 and V4 region to differentiate similar microbes. This is probably why V3-V4 instead of simply V3 was chosen for amplification. The 150 bp paired-end-kits are also cheaper, yet people are willing to pay for the 300 bp paired-end-kits to get longer amplicons and better resolve similar microbes.

Longer reads also provide more detail for taxonomic classification. Moving from 100 to 200 bp total gives the taxonomic classifier twice as much information to work with when it searches the database.


This is a perfect use case for positive controls with a known composition. Did you include any of these on your runs?

3 Likes

Hello Colin,

Thank you very much for your reply. In my analysis I‘m more interested in observing changes in alpha/beta diversity rather than the most fine grained differentiation of each and every subtype of S. Aureus (for example). So with you explanation I would try to maximize the number of reads in my case by going with the truncation length of 50).

No, positive controls were unfortunately not included in this sequencing run but will be in the next runs. I know, not ideal and should have been done
This was a one-off run because someone had some space left on an Illumina run. Next time I‘ll make sure to integrate controls.

1 Like

Hi!
I hope that it is OK that I will join the discussion. There is another factor that is important for alpha and beta diversity metrics - the number of unique ASVs. The shorter reads you are working with - the lower is that number. Based on your graph, there is no big difference between the length 50 and 100. So definitely 100 would be my choice when deciding between these 2 numbers.
However, I would also to try to go for even longer reads. Did you try:

  • Remove primers with cutadapt before dada2 (it looks like your primers are still there)
  • Discard all the reads without primers (there is such option in the Cutadapt plugin)
  • Set truncation to ~200 (check the quality plot again after cutadapt)
  • Increase --p-max-ee number

This approach will:

  • Reduce false variability between ASVs
  • Provide you with longer reads that is essential for better estimation of alpha/beta diversity metrics
  • Recover more ASVs that were lost due to higher truncation threshold and bad quality of the reads.

To summarize, my minimum would be 150 and maximum around ~200.

Best,

2 Likes

Hello Timur,

thanks for your answer! It's very appreciated

  • Remove primers with cutadapt before dada2 (it looks like your primers are still there)

I think you're referring to the steep decline at the beginning of the forward reads. I manually checked like 10 fastq-files for the presence of the primers (341F CCTACGGGNGGCWGCAG (17nt) and 805R GACTACHVGGGTATCTAATCC) and I didn't find them (see the screenshot below)

But to double check I ran cut adapt with these primers and there were indeed some primers to be trimmed.

Where does cutadapt find these primers when I don't find them manually?

The quality plot after cut adapt looks better than the one in my opening post. So I want to believe cut adapt, but I don't understand where it finds the primers to be removed when I don't see them myself in the sequences.

(This may be a bit of a cutadapt-sidequest, so maybe it should be split into a new topic?)

Thank you for your time and replies!

1 Like

You are right, the topic now a little bit expanded, but I think that we are still trying to solve your original issue, so I would prefer not to split it unless you will have completely new issues. At least, for now.

Probably you can't find the primers because they are degenerative - you have N in the sequence instead of A, T, C or G. Cutadapt knows how to handle it properly. If you are curious about it, you can try to replace them with some nucleotides by your choice while searching via text editor to see if some sequences contain it.

1 Like

Hello,

in order to round things off (maybe for future readers), I followed your ideas @timanix.
After using Cutadapt, the graph looks like this

Since there is not significant difference between 100 and 150, I will opt for 150 in order not to lose ASVs.

Thank you all for your replies!

1 Like

Thank you for the updates! I was curious about how you are going to proceed.

If I interpret the graph correctly, with a length of 200, only one sample has an amount of reads less than 100K. If this is the case, I would even go for 200, since I prefer to have longer reads than a higher overall count (if you still have enough of them, of course!). For me, median count 200K + 200 length looks better than 250K + 150.
But it is only my opinion and it is up to you to decide how to proceed. Anyway, good luck (oder Viel Glück) with downstream analyses!

Best,

1 Like

Hi Timur,

thanks! One sample is always below a read count of 100k no matter the truncation length.

I'm not sure if I understand you correctly:

median count 200k + length 200 looks better than 250k + 150

But at length 200 the median count already drops below 200k (183147)?

I was too lazy to look into the tables so just estimated based on the graph you provided. So 183k and 200 length for me is better than 228k and length 150

Thank you! I will do diversity-analyses with both length and compare potential differences! Thanks a lot!