Welcome to the forum, @afinaa!
I haven’t used bbduk, but I’m going to give your question a shot anyway. Maybe we can learn something together.
If I’m reading your image and question correctly, you are using bbduk to subsample your sequences to an even depth, preserving 100k reads of length n from each file of sequences. Based on the file names, this looks like paired-end Illumina data. Does that sound right?
If so, that means you initially imported 20 files into QIIME 2, where 10 contained 100k forward reads (R1), and 10 contained 100k reverse reads (R2). When you ran dada2 denoise-paired, it joined the forward and reverse reads, so that each of those 100k forward reads was combined with one of the 100k reverse reads, leaving 100k joined reads per sample.
The big-picture questions I have for you are mostly about how BBDuk does its subsampling, and what it’s doing for you that isn’t possible in the commands you’re running with QIIME 2. To be clear, I am not suggesting you shouldn’t us BBDuk, or that QIIME 2 is everything to everyone. Rather, I’m wondering whether you’re duplicating your efforts unnecessarily, and possibly opening yourself up to errors in analysis along the way.
For example, the fact that you’re only retaining 6-9% of your reads after DADA2 raises red flags for me, and I’m wondering whether DADA2 is getting forward/reverse read pairs that don’t actually match. Is BBDuk selecting matched reads from your forward and reverse samples, or is it treating each file as an independent sample? I have no idea whether this is happening and causing data loss, but I think it’s worth investigating.
Once you’re satisfied that this question is resolved, it might be worth opening a new topic in the General Discussion category to talk though your objectives in subsampling, and how best to go about achieving your goals.
I checked what was done and found out that bbduk didn't actually remove my adapter (it was probably something that we did wrong, not the tool). Therefore, I used QIIME2 for my primer and adapter removal as follow. Although I manually checked and seems like there is no primer attached in my sequences, I still did the step to ensure nothing is left.
and now the non-chimera is around 40%, which I supposed is good but I think it should be higher. And
around 80% of my input passed the filter which means there are still primer/adapter in it?
And along my research about this, I also found that there is a subsample-paired to randomly subsample from the sequence based on a fraction. Is there any other way to subsample based on fixed number of reads or set a maximum number of reads to be processed? The option filter-samples is only after denoise process, is this correct?
(Matthew Ryan Dillon)
Glad to hear you resolved the trimming issue, @afinaa. An 6-10x improvement in your feature counts is a great start, but as you suggest, it’s worth seeing whether you can get more data out of DADA2.
DADA2 filters based on quality scores. You may be able to fix this by setting better trim/trunc parameters. This is covered in depth in other forum posts. A little reading may help you optimize your choices.
Before we talk about subsampling tools, why are you interested in subsampling your data? What are you trying to accomplish?
Thank you Chris for the reply. I am reading other forum posts for better understanding.
The reason of this is because our sequencing run does not generate a consistent output. For example, as can see from my samples, Sample6 and Sample7 have much higher output than the rest of the samples. And we actually do have other 100+ samples, therefore we wish to have a fixed number of reads to be analysed for each sample.
Thanks for the explanation, @afinaa! I suspect you're working harder than you have to. QIIME 2's usual approach to sequencing-depth normalization is to let plugins take care of their own normalization, rather than normalizing everything upfront. For example, if you were to generate some basic biological diversity data with qiime diversity core-metrics-phylogenetic, the command would ask you for a --p-sampling-depth INTEGER which it uses to randomly subsample your sample data down to a given level. Other tools may handle normalization in their own ways, to meet their own needs.