I detected even 121,945 features from 212 samples, which greatly shocked me. Many of them are only present in one sample, but not singleton.
I am certain that the adapter and primer have been removed from the paired reads. Is this huge number of ASVs detected normal?
Alternatively, I merged the paired-end reads with FLASH prior to DADA2, which still did not alleviate this issue. Although some features could be removed according to frequency or prevalence, I am afraid this would impact the downstream analysis.
Hi, @Cobaya417 . I deeply appreciated your help. Indeed, the sequencing was performed at NovaSeq™ 6000 System. I never realized this issue, which, I guess, was raised by the quality-score problem.
BTW, I have a concern about whether the OTU method is somehow out of date, although many studies also adopt this approach. I found someone who argued that the DADA2 outperforms the OTU. Is there any solid evidence supporting that DADA2 is superior to OTUs?
If you would like to keep an ASV approach, you can denoise using deblur which does not take into account the quality score but only the sequences, please look at “Moving Pictures” tutorial — QIIME 2 2023.2.0 documentation denoising option 2.
On the number of variants (121,945), it is difficoult to asses the number of sequence without more information. What type of sample are? Do you expect to be similar samples or there shoul dbe groups with separate seqeunces? Do you have any other experiment/data which tells you that your number is to large? With the Novaseq sequencing you are reaching an higher total coverage fo reach sample, hence you could have more low-abundanced species for each sample, as well as highier counts for abundant species.
Hi, @llenzi . Thanks for the suggestion. My sequencing samples are human tissue, which should be a low-biomass environment. Thus, I think the number of detected ASVs should not be as large as ~120k, which is larger than that I have experienced when processing other human fecal samples. As I have never noticed the sequencing platform is also a potential factor impacting the data analysis, I have no idea whether ~120k ASVs are a fact or an artifact. I am trying to use other methods to see whether the biological interpretations are similar.
try different approach makes sense.
You can look also at th edistribvution of the ASVs, as if there are any present in all the samples at low frequency which may be derived by contamination and /or spurious event, if you have negative and positive controls in the dataset.