Long tail of rare ASVs after dada2 denoise-paired

Hi all,
I am working with qiime2- V2024.5, dada2 and qiime feature-classifier (blast) scripts.
My question is regarding the output table.qza file.
When running these scripts I am get a strong "long tail" effect, meaning that out of ~800-1000 ASVs (total in all samples) I get around 80% of the ASVs present in only one sample (out of ~20 samples).
I am trying to understand if the dada2 script is not accurate. As recommended, I am using the default settings... Here is the script:
qiime dada2 denoise-paired --i-demultiplexed-seqs /home/microbiome/Desktop/mice2/mice_small_paired-end-demux.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 240 --p-trunc-len-r 240 --p-n-threads 0 --o-representative-sequences mice_rep-seqs.qza --o-table mice_table.qza --o-denoising-stats mice_dada2.qza --verbose
Thanks for your help.

Hi @Future_Microbiome,
There are a couple of things to do to fix this! Like setting a prevalence threshold: filter-features-conditionally: Filter features from a table based on abundance and prevalence — QIIME 2 2024.5.0 documentation

From what I underatand, your question seems to be asking about whether dada2 is working for your data? Have you looked at your dada2-stats.qzv? This will give us an insight on what sequences made it passed dada2 and if we needed to tweak some parameters.

Could you post your dada2-stats.qzv so I can take a look?

1 Like

Hi @cherman2 , thanks for your prompt reply.
You are right, I want to understand if dada2 working well on my data sets (it's not only one set, it's multiple projects involving soil, sludge and animal secretion samples).
I am attaching the dada2 output file.
Mice_full_dada2.qza (12.1 KB)
Thanks a lot

Hi @Future_Microbiome,
Looks like you are loosing a decent amount of your sequences in the merging stage. And you are loosing even more :scream: in the chimeric filtering step.

The merging issue is easier to fix then the chimeric detection step. The chimeric detection step is hard to mess around with because we don't want to reduce parameters rigidity and let real chimeras into our data! :stop_sign:

So lets look at getting those merging percentages up!
What region of the 16s amplicon are you using?
Would you mind sending your demux.qzv over so I can look at quality plot?

1 Like

Hi @cherman2,
I am using the universal microbiome primers 515f - 806r.
I am attching the demux.qzv file.
I'll just note that the raw fastq data I get from the sequence provider is not the highest quality, according to their OTU report (they also provide an OTU table using CLC program - which I don't use) that demonstrates that more than 50% of their sequence reads are discarded due to chimeric and filtering reasons.
Mice_full_paired_end.qzv (311.6 KB)

Hi @Future_Microbiome,
Yeah, yea you are right that the quality isn't stellar here.
Could you try dada2 but with --p-trunc-len-f 250 --p-trunc-len-r 250? Then send the dada2-stats for this run?

I am not sure if we will be able to get sequences past filtering but it might give us a better shot of merging the sequences if they do pass filtering.

Alot of problem solving with dada2 is just getting fiddling with parameters until we feel like we have the best quality sequences we can get. Sorry if this gets repetitive as we try to debug!

1 Like

Hi @cherman2 .
attached are both files, 250 ad 200. It seems that the shorter the sequence the better the filtering. Usually, what are we expecting to get in a good run?
Mice_full_250_dada2.qza (12.1 KB)
stats.tsv (1.1 KB)
Thanks a lot for your help.

Hi @Future_Microbiome,

Sorry, this turned into a long answer!

Basically, when there is too much noise in the data (i.e. bad quality data) dada2 starts throwing away samples because they are too noisy. When we trim, we typically are cutting out the noiser part of the data(typically we see the quality decrease as the sequences length increases) allowing for more samples to make it through dada2's denoising steps. :tada::

Your 200 trunc run looks the best to me. You are keeping the most sequences through filtering and merging with that trunc length. Having said that, this is your data so you should choose the run that you think makes the most sense for your data.

Unfortunately, it looks like no matter what you do, you are losing alot of reads to chimeric filtering. There are some parameters that you can tweak to try to get more reads passed this step. I am always alittle weary :fearful: when messing with chimeric filtering default parameters because we don't want to relax our thresholds too much and end up with chimeras in our data :-1: .

Here are some good chimeric filtering forum posts, if you want to test out tweaking those parameters: high chimera rate in dada2 - #4 by Nicholas_Bokulich :smile:

My rule of thumb is if all samples lose more then 50 % of their reads in any step (filtering, denoising, merging, or chimeric detection), I typically look at changing parameters to try to get a better results. For your data, your chimeric detection step is losing ~50% in each sample. But sometimes thats just data, sometimes its not perfect :person_shrugging:

Circling back to your orginal question :question: :

I think that you are using dada2 correctly and that unforunately the quality of the sequences are leading to less than stellar results from dada2. For continuing on with this dada, I would recommend looking at setting a prevelance threshold.

I hope all this helps! Let me know if you have any more questions or need clarifications on anything!



Dear @cherman2, thanks for your elaborated explanation.
I agree that changing the filtering parameters is a less preferable solution. For now, I will try to contact the sequencing provider and discuss the issue with him, trying to understand exactly its library preparation and sequencing procedures.
Thanks again