High number of chimera and low prevalence for ASVs

Dear all,

I have a demultiplexexd 16S V4 dataset (250PE) for 1621 stool samples (~700 unique subjects longitudinal at multiple time points). Altogether, I obtained 261,819,393 raw reads. Below is the fastqc plot.

  • The quality of reverse read starts dropping a lot after 150bp.

  • I use the criteria below using DADA2 for data processing.
    truncate Qscore = 2, maximum EE = 1, truncate reverse read at 150bp, fastqc plot shows a quite good quality after this.

  • After merging paired-end reads, we get 408,017 ASVs. However, 87.7% ASVs are regarded as chimeras, which accounts for 28.7% of merged reads (52.6% of raw reads), resulting in 50,001 ASVs.

  • Among them, 94% of those ASVs have <1% prevalence and the remaining 6% of those ASVs have prevalence ranging from 2-10%.

For your reference, below is the number of reads retained at each step.

Here I have 3 questions.

  1. Is the number of ASV reasonable for such a big dataset? (It’s really a huge amount compared with using OTU delineation)

  2. Is more than 87% of ASVs classified as chimera reasonable? Why are so many ASVs regarded as chimeras?

  3. The prevalence is very low, quite unusual for gut microbiome samples.

BTW, “–p-min-fold-parent-over-abundance FLOAT” is set to 2.

May I have your comments and suggestions on this output?

Best regards,

1 Like

@thermokarst may I have your comments on this “strange” output?

Hi @Claire010 ,

Would you be able to share the demultiplexed.qzv (after cutadpat if applicable) and the dada2 statistic qzv ?

The reason for that is that in general is better to look at the quality profiles in the way as qiime2 does, fastqc is good but it tends to bin the quality for many position at the tails, for which qiime2 viewer may give more resolution on these.

On the statistics you are showing, I am not sure if I would interpret as you are doing, but I’d like to see the qiime2 artefact to get to be sure. It usually returns the sequences passing any filters, so my interpretation would be you have 71.3% of reads left after merging and not chimeric. That is not very high but not unreasonably low. (But really, the table you showing is confusing for me sorry …)

Did yo trim the reads before importing in qiime2? Or you did specify the trimming length within the dada2 command? If so, which trimming length you set for forward read?

Just one more question on the prevalence you pointing out, you number of samples is quite high, are these processed at the same time as well as run on the same sequencing lane?
If, as I suspect, they were split on several batch you should denoise each single batch and then merge the results. Could you confirm on this? EDIT: please note that you should use same denoising parameters for all the batches!

Let wait for @thermokarst point of view when he can, but I am sure the above answers it will be very useful too him too!

Hope it helps