dada2 statistics - input reads less than expected

Hi everyone,

I trimmed my paired-end data to read only the first 200k reads using bbduk (image below shows 100k for each R1 and R2) and then used DADA2 to denoise.

Then, I checked the statistics produced by the pipeline, and it only stated 100k as the input reads for 1 sample instead of 200k as below.
dada2-stats.qzv (1.2 MB)

If QIIME2 already merged my data, shouldn't it be 200k reads?
Or am I reading the column in the dada2-stats.qzv wrongly?

Thank you in advance.

Welcome to the forum, @afinaa!
I haven’t used bbduk, but I’m going to give your question a shot anyway. Maybe we can learn something together. :slight_smile:

If I’m reading your image and question correctly, you are using bbduk to subsample your sequences to an even depth, preserving 100k reads of length n from each file of sequences. Based on the file names, this looks like paired-end Illumina data. Does that sound right?

If so, that means you initially imported 20 files into QIIME 2, where 10 contained 100k forward reads (R1), and 10 contained 100k reverse reads (R2). When you ran dada2 denoise-paired, it joined the forward and reverse reads, so that each of those 100k forward reads was combined with one of the 100k reverse reads, leaving 100k joined reads per sample.

The big-picture questions I have for you are mostly about how BBDuk does its subsampling, and what it’s doing for you that isn’t possible in the commands you’re running with QIIME 2. To be clear, I am not suggesting you shouldn’t us BBDuk, or that QIIME 2 is everything to everyone. Rather, I’m wondering whether you’re duplicating your efforts unnecessarily, and possibly opening yourself up to errors in analysis along the way.

For example, the fact that you’re only retaining 6-9% of your reads after DADA2 raises red flags for me, and I’m wondering whether DADA2 is getting forward/reverse read pairs that don’t actually match. :face_with_monocle: Is BBDuk selecting matched reads from your forward and reverse samples, or is it treating each file as an independent sample? I have no idea whether this is happening and causing data loss, but I think it’s worth investigating.

Once you’re satisfied that this question is resolved, it might be worth opening a new topic in the General Discussion category to talk though your objectives in subsampling, and how best to go about achieving your goals.

Hope this helps,
Chris :dog:

1 Like

Hi @ChrisKeefe,

Thank you for the welcome and replying to my question here. Much appreciated! =D

This is correct.

Thank you for pointing this out. I did not realize the reads after DADA2 was actually that low. :astonished:
From my understanding, BBDuk does select from matched reads but I will have to check again.

Thanks again for the help! :smiley:


Happy to help, @afinaa! Let us know what you find out, OK? I’m curious. :slight_smile:

1 Like

I checked what was done and found out that bbduk didn't actually remove my adapter (it was probably something that we did wrong, not the tool). Therefore, I used QIIME2 for my primer and adapter removal as follow. Although I manually checked and seems like there is no primer attached in my sequences, I still did the step to ensure nothing is left.

qiime cutadapt trim-paired --i-demultiplexed-sequences 2-Sequence_QC/rawdata.qza
--p-front-f 'my-primer'
--p-front-r 'my-primer'
--o-trimmed-sequences 2-Sequence_QC/trimmed.qza

qiime dada2 denoise-paired --i-demultiplexed-seqs 2-Sequence_QC/trimmed.qza
--p-trim-left-f 17 --p-trim-left-r 21 --p-trunc-len-f 240 --p-trunc-len-r 240
--o-representative-sequences 2-Sequence_QC/dada2-rep-seqs.qza
--o-table 2-Sequence_QC/dada2-table.qza
--o-denoising-stats 2-Sequence_QC/dada2stats.qza

dada2stats.qzv (1.2 MB)

and now the non-chimera is around 40%, which I supposed is good but I think it should be higher. And
around 80% of my input passed the filter which means there are still primer/adapter in it? :thinking:

And along my research about this, I also found that there is a subsample-paired to randomly subsample from the sequence based on a fraction. Is there any other way to subsample based on fixed number of reads or set a maximum number of reads to be processed? The option filter-samples is only after denoise process, is this correct?

Thank you.

Glad to hear you resolved the trimming issue, @afinaa. An 6-10x improvement in your feature counts is a great start, but as you suggest, it’s worth seeing whether you can get more data out of DADA2.

DADA2 filters based on quality scores. You may be able to fix this by setting better trim/trunc parameters. This is covered in depth in other forum posts. A little reading may help you optimize your choices.

Before we talk about subsampling tools, why are you interested in subsampling your data? What are you trying to accomplish?


Thank you Chris for the reply. I am reading other forum posts for better understanding. :open_book: :open_book:

The reason of this is because our sequencing run does not generate a consistent output. For example, as can see from my samples, Sample6 and Sample7 have much higher output than the rest of the samples. And we actually do have other 100+ samples, therefore we wish to have a fixed number of reads to be analysed for each sample.

1 Like

Thanks for the explanation, @afinaa! I suspect you're working harder than you have to. QIIME 2's usual approach to sequencing-depth normalization is to let plugins take care of their own normalization, rather than normalizing everything upfront. For example, if you were to generate some basic biological diversity data with qiime diversity core-metrics-phylogenetic, the command would ask you for a --p-sampling-depth INTEGER which it uses to randomly subsample your sample data down to a given level. Other tools may handle normalization in their own ways, to meet their own needs.

You might save yourself some time and trouble reviewing the tutorials in the QIIME 2 docs. The Parkinson's Mouse tutorial might be a good place to start.

Happy :qiime2:-ing!

1 Like

Hi Chris, thank you for the suggestion. You are a great help! :grin:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.