Multiple runs, merged after DADA2, overestimation of diversity??


I have 445 samples acquired in 5 different MiSeq Illumina runs. The read quality differed between runs and I have decided to trim, kind of arbitrarily, forward and reverse reads at position 230 to have the same read length all over the 5 runs. (I upload the .qzv files for each run after import)

paired-end-demux-run1.qzv (286.8 KB)
paired-end-demux-run2.qzv (290.4 KB)
paired-end-demux-run3.qzv (288.5 KB)
paired-end-demux-run4.qzv (292.5 KB)
paired-end-demux-run5.qzv (287.0 KB)

Command used for each run (Qiime version qiime2-2018.8):

qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-trunc-len-f 230
--p-trunc-len-r 230
--output-dir DADA2

After this step the following number of features (ASV) were "called" for each run:

run1: 393
run2: 1,365
run3: 180
run4: 997
run 5: 451

After merging results (using qiime feature-table merge) I get 2,289 ASV

I have the "feeling" this is overestimating the real number of ASV.

My questions are:

- Have anyone compared the results of merging runs vs the same samples in only one run?

- How do you "filter" the ASV table? do you remove singletones and doubletones? Do you remove low abundance ASV? using which threshold?

Thanks a lot for any feedback!




Hi @mbcarbonetto,

Why do you feel this is an overestimation? 2,289 is actually on the lower end of ASV numbers I typically see around here. I’m wondering the opposite actually, whether this is an underestimation of your community. How did the stats-results of DADA2 look? Also, what is the target community and the expected diversity of these samples? Keeping in mind that ASVs can be different from each other by a single nt so many of them may in fact collapse down to the same species (or OTUs if you are comparing them to previous methods).

By ‘run’ do you mean on the same sequencing run or combining your files then performing DADA2 combined? The latter is only appropriate if all the samples were prepared together (i.e same PCR and sequencing run). Since your samples come from 5 different runs, the method you’ve chosen is the only correct way to do this. Process them separately with the exact same truncating/trim parameters then merge. :+1:

The choice of filtering is completely up to you and it should be chosen based on what you plan on doing with the data. For example, if alpha diversity measures are important for your analysis then discarding of low abundance ASVs is not advisable as it throws away meaningful information that can confound the results. On the contrary, differential abundance tests like gneiss and ANCOM work better when low frequency and abundance ASVs are discarded as they provide no real information to their respective tests and add ‘noise’ to the analyses. You will likely use a combination of different filtering parameters depending on what tests you want to perform. There is a full tutorial here on how filtering can be done in qiime2.

1 Like

Hi Mehrbod_Estaki,

Thanks for the reply and information.

My target community is mouse gut after antibiotic treatment so the low number of ASV is not a concern, this is expected.
I feel there is an overestimation since I was expecting a number close to a mean value of the number of total reads when merging runs after DADA2. At least is is not the sum of features.

yes I meant Miseq sequencing runs.

thanks again!


Hi @mbcarbonetto,

Well technically you are not getting a sum of your features from the runs, if that was the case you would have gotten 3,386 ASVs and not 2,289. So I think everything is ok at this point. There are a few other things to note. I should have mentioned this earlier but what is the target region of your amplicons? Just want to make sure your 230 truncate parameters leaves enough overlap for proper merging. Did your dada2 stats summary look ok?
If you think the diversity is till higher than you expect, sometimes with mouse gut tissues (not fecal) I often find a bunch of host reads that I filter out after DADA2. I find these with V3-V4 amplicons but it might be the case with other regions too since DNA extraction of intestinal bacteria often involves some thorough tissue disruption that can also introduces host DNA.
You can try a positive filter approach (something like Deblur does in the background as well). I’ve had good results with using the 88_otus greengenes database and using qiime quality-control exclude-seqs at 65% identity and 60% alignment with the vsearch searching. It is very fast and does a great job of getting rid of host contaminants or any other artifacts that may have been missed in chimera removal. Worth a try!
You may also be interested in reading up on the batch effect and batch correction methods with 16S data. There are tons of discussions and recommendations with regards to that on the forum.
Let us know how it goes!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.