Up to 97% of sequences being removed by filtering/denoising in DADA2

nutrishinn · December 19, 2019, 9:16pm

I ran QIIME2 2018.4 and 2018.6 in September 2018 on MiSeq single-end sequencing data and realized that a large number of my sequences were being cut out after using DADA2, with one sample being cut down from ~3500 to 631. I thought that this issue might be based on the --p-min-fold-parent-over-abundance FLOAT parameter based on this discussion post, so I spent some time re-running my data and tweaking the --p-min-fold-parent-over-abundance FLOAT parameter by setting it to 2, 4, 6, and 8. My code for this step was as follows:

qiime dada2 denoise-single
--i-demultiplexed-seqs demux-food5.qza
--p-n-threads 24
--p-trim-left 19
--p-trunc-len 271
--p-trunc-q 20
--p-min-fold-parent-over-abundance 8
--o-representative-sequences rep-seqs-dada2-food5_parent8.qza
--o-table table-dada2-food5_parent8.qza
--o-denoising-stats stats-dada2-food5_parent8.qza
qiime metadata tabulate
--m-input-file stats-dada2-food5_parent8.qza
--o-visualization stats-dada2-food5_parent8.qzv

We set the --p-trunc-q 20, but all of the quality scores are above 20 so that shouldn't be dropping anything. Additionally, the --p-trim-left 19 and --p-trunc-len 271 parameters are set to remove primer sequences we know are present, as our sequences should be 250 bp.

However, after completing these additional analyses, I discovered that the chimera removal was not removing as many sequences as the filtering and denoising step in DADA2.

The data being run are from 5 separate clinical trials examining 5 separate foods, which we for our purposes, are being merged together. The input and subsequent information following it in the table is just from one sample from each of the 5 foods that I selected as an example. You can see in the table below, for example, Food 5 had 97% of its sequences removed from the filtering and denoising steps of DADA2, independent of what the --p-min-fold-parent-over-abundance parameter was set to, which leads me to believe that this is not impacting the removal of sequences as much as the filtering and denoising step and I'm wondering why?

colinbrislawn · December 19, 2019, 9:33pm

Good afternoon Leila,

Welcome back to the forums!

I think you are on the right track:

A huge number of your reads are getting filtered out, which is unexpected... but might also be a good thing! I know I would rather have a smaller amount of perfect data , than lots of low quality data .

For if we look at the denoising-stats.qzv visualization from the Atacama Soils Tutorial, you will see that about 80-90% of their data passes the dada2 filter. And they get these high number after carefully choosing the the left and right trim and trunc parameters for their reads.

I bet with the correct read trimming settings, you will get more of your reads to merge and pass filter for all food types!

Can you post your demux.qzv that has the interactive quality plots? We can use those to pick out good trimming settings!

Colin

P.S. Edit:

Additionally, the --p-trim-left 19 and --p-trunc-len 271 parameters are set to remove primer sequences we know are present, as our sequences should be 250 bp.

We should also set them to remove parts of the read that are too low quality to merge well.

nutrishinn · December 20, 2019, 9:17pm

Great, thanks for validating my thoughts @colinbrislawn! Your explanation is very clear. Thank you!

I've attached the demux.qzv file demux_Food5.qzv (288.5 KB) that was the most concerning (Food 5's), but I can share the others as well if that would help, just let me know! Thanks again for your help!

Leila

colinbrislawn · December 20, 2019, 9:25pm

Hey Leila,

I took a look at the plot, and focused on where you trimmed at --p-trunc-len 271 here's the quality I saw there:

That 50% medium quality score of 22 is not great! What if you tried –p-trunc-len 230 and see how many of your reads pass filter then?

I think there is going to be an essential tradeoff between

longer, noisy reads (many of which get filtered out) or
shorter, high-quality reads (almost all of which get through filtering)

That choice is up to you! Let me know what other settings you try after trying 230!

Colin

nutrishinn · December 20, 2019, 9:45pm

Got it, thanks so much @colinbrislawn!

I'm going to have to play around with this more after the holidays. I promised myself I'd take some time away from technology over the coming weeks to focus on relaxation and spending time with friends and family. So I will let you know in the first couple weeks of January!

Hope you get some relaxation time as well!

system · January 21, 2020, 3:45am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.