Loosing to many samples when denoising

nietof · November 30, 2021, 12:08am

I am running an analysis on 16S rRNA data sample sequences from iHMP. I imported paire-end reads for 705 samples using the Casava 1.8 format. Here is the demux.qzv file:
Abeni-demux.qzv (368.2 KB)

When I denoised the samples using DADA2 it only kept 126 samples of the 705 samples imported. The highest percentage of sequences that were non-chmieric in a sample is around 54%.
I tried several strategies based on previous threads posted in this forum and tried using the approaches suggested and the outcome was similar. Here are the commands I used in my last attempt at denoising the samples:

qiime dada2 denoise-paired
--i-demultiplexed-seqs Abeni-demux-paired-end.qza
--p-trim-left-f 20
--p-trim-left-r 20
--p-trunc-len-f 291
--p-trunc-len-r 274
--p-max-ee-f 5
--p-max-ee-r 5
--p-min-fold-parent-over-abundance 0.75
--o-table Abeni-table.qza
--o-representative-sequences Abeni-rep-seqs.qza
--o-denoising-stats Abeni-denoising-stats.qza posts Here is the denoise-stats.qzv

Here is the output denoising-stats.qzv Abeni-denoising-stats.qzv (1.2 MB)

In previous threads it was suggested to run only the forward reads to improve the outcome of denoising using DADA2. I just want to make sure I am not missing anything in my dataset that is unusual. I am particularly surprised as these are samples downloaded from iHMP.
Thank you for your help

Fernando

colinbrislawn · November 30, 2021, 3:17am

Hello Fernando,

Thanks for bringing your question to the forums, and including such a detailed description. That's awesome!

(Full disclosure, I was worked on the iHMP/HMP2 project about five years ago, and while I'm not one of PIs or lead authors, I want folks to use these data sets in cool new ways.)

Let's dive in

I think the main issue is the drop in quality in the reverse R2 read. I've annotated the truncation lengths you used on the x-axis

Even with genous maximum expected error thresholds (max-ee), the quality is probably too low for many of these reads to pair, given that no differences are allowed by q2-dada2 (due to mergePairs(..., maxMismatch = 0), for now...

Do you know if this is the V4 16S amplicon? If so, you might get better results using much, much shorter truncation settings, so that reads are able to join with exact overlap. Try something like this!

--p-trunc-len-f 200
--p-trunc-len-r 140 # <- or make this even shorter!

If these reads target the V4 region, this should still give plenty of overlap in an area of high quality, resulting in fewer mismatches and more joined reads.

Thanks for using the iHMP data set! Let me know what you find!

nietof · November 30, 2021, 8:20pm

Colin
That is awesome you worked on the iHMP/HMP. Couldn't get better advise than yours. So according to the article published on the MOMS-PI they target the V1-V3 and the amplicon size is app. 540. That is why I used the longer truncation to make sure I had enough overlap.

colinbrislawn · November 30, 2021, 8:44pm

Got it. OK, with maxMismatch = 0, these reads are never going to join.

One option is to join using another program, say vsearch or DADA2 directly in R, where we have more control of our settings and mismatches. Another option is to use just one of the reads, so we don't have to pair at all.

I'm not familiar with the MOMS-PI cohort and the data they published... If they published joined reads, you could import those and go from there.

nietof · December 1, 2021, 11:51pm

Colin
If I merge the reads using DADA2 in R, once the merge is complete, how do I bring back the dada2 output, the table, rep-seq and stats back, as qiime2 qza artifacts to continue the analysis in qiime2?
I would also appreciate If you could also guide me a bit in terms of the parameters in DAD2 that I should play with in R. I am going to run their tutorial as well.
thank you
Fernando

colinbrislawn · December 4, 2021, 10:52pm

Hello Fernando,

Thank you for your patience.

Let's dive back in!

You could export these data types from your R session, then import them as Qiime2 artefacts.

You could also do all your analysis in R, keeping track of your work using R markdown. (I've seen people to upstream processing in Qiime2, then export their data so they can do downstream analysis in R. I guess this depends on how much you like R... )

Sure thing! Their tutorials are awesome, but if you have any questions, feel free to open a post in Other Bioinformatics Tools and @ me.

Colin

system · January 5, 2022, 4:52am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.