too many features after DADA2

I have 16S rRNA data from tissue samples. The sequencing quality is acceptable, so I did not trim the paired-end reads when processing with DADA2.

The command for DADA2 is as follows.

qiime dada2 denoise-paired --p-n-threads 60 --i-demultiplexed-seqs paired-end-demux.qza --p-trunc-len-f 0 --p-trunc-len-r 0 --o-table dada2_table.qza --o-representative-sequences dada2_rep_set.qza --o-denoising-stats dada2_stats.qza

I detected even 121,945 features from 212 samples, which greatly shocked me. Many of them are only present in one sample, but not singleton.


I am certain that the adapter and primer have been removed from the paired reads. Is this huge number of ASVs detected normal?

Alternatively, I merged the paired-end reads with FLASH prior to DADA2, which still did not alleviate this issue. Although some features could be removed according to frequency or prevalence, I am afraid this would impact the downstream analysis.

Any suggestions would be greatly appreciated!

I recommend you to use a OTU approach, as you seem to have Novaseq/Nextseq/Binned scores, you need to correct the dada2 model manually (Binned quality scores and their effect on (non-decreasing) trans rates · Issue #1307 · benjjneb/dada2 · GitHub) or go for the good old clustering approach (OTU)

Kind regards,

PS: Regarding the huge number of ASV (an effect of binned scores?), filter by incidence and abundance, you will end removing probably more than 50% of them

Hi, @Cobaya417 . I deeply appreciated your help. Indeed, the sequencing was performed at NovaSeq™ 6000 System. I never realized this issue, which, I guess, was raised by the quality-score problem.

BTW, I have a concern about whether the OTU method is somehow out of date, although many studies also adopt this approach. I found someone who argued that the DADA2 outperforms the OTU. Is there any solid evidence supporting that DADA2 is superior to OTUs?

Hi @Wei_Zhang,
If you would like to keep an ASV approach, you can denoise using deblur which does not take into account the quality score but only the sequences, please look at “Moving Pictures” tutorial — QIIME 2 2023.2.0 documentation denoising option 2.
On the number of variants (121,945), it is difficoult to asses the number of sequence without more information. What type of sample are? Do you expect to be similar samples or there shoul dbe groups with separate seqeunces? Do you have any other experiment/data which tells you that your number is to large? With the Novaseq sequencing you are reaching an higher total coverage fo reach sample, hence you could have more low-abundanced species for each sample, as well as highier counts for abundant species.

Hope it helps

Hi, @llenzi . Thanks for the suggestion. My sequencing samples are human tissue, which should be a low-biomass environment. Thus, I think the number of detected ASVs should not be as large as ~120k, which is larger than that I have experienced when processing other human fecal samples. As I have never noticed the sequencing platform is also a potential factor impacting the data analysis, I have no idea whether ~120k ASVs are a fact or an artifact. I am trying to use other methods to see whether the biological interpretations are similar.

Hi @Wei_Zhang,
try different approach makes sense.
You can look also at th edistribvution of the ASVs, as if there are any present in all the samples at low frequency which may be derived by contamination and /or spurious event, if you have negative and positive controls in the dataset.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.