High number of chimera and low prevalence for ASVs

Claire010 · September 13, 2020, 4:14am

Dear all,

I have a demultiplexexd 16S V4 dataset (250PE) for 1621 stool samples (~700 unique subjects longitudinal at multiple time points). Altogether, I obtained 261,819,393 raw reads. Below is the fastqc plot.

The quality of reverse read starts dropping a lot after 150bp.

I use the criteria below using DADA2 for data processing.
truncate Qscore = 2, maximum EE = 1, truncate reverse read at 150bp, fastqc plot shows a quite good quality after this.

image1240×484 67.5 KB
After merging paired-end reads, we get 408,017 ASVs. However, 87.7% ASVs are regarded as chimeras, which accounts for 28.7% of merged reads (52.6% of raw reads), resulting in 50,001 ASVs.
Among them, 94% of those ASVs have <1% prevalence and the remaining 6% of those ASVs have prevalence ranging from 2-10%.

For your reference, below is the number of reads retained at each step.

Here I have 3 questions.

Is the number of ASV reasonable for such a big dataset? (It's really a huge amount compared with using OTU delineation)
Is more than 87% of ASVs classified as chimera reasonable? Why are so many ASVs regarded as chimeras?
The prevalence is very low, quite unusual for gut microbiome samples.

BTW, “–p-min-fold-parent-over-abundance FLOAT” is set to 2.

May I have your comments and suggestions on this output?

Best regards,
Claire

Claire010 · September 15, 2020, 9:04am

@thermokarst may I have your comments on this "strange" output?

llenzi · September 15, 2020, 9:43am

Hi @Claire010 ,

Would you be able to share the demultiplexed.qzv (after cutadpat if applicable) and the dada2 statistic qzv ?

The reason for that is that in general is better to look at the quality profiles in the way as qiime2 does, fastqc is good but it tends to bin the quality for many position at the tails, for which qiime2 viewer may give more resolution on these.

On the statistics you are showing, I am not sure if I would interpret as you are doing, but I'd like to see the qiime2 artefact to get to be sure. It usually returns the sequences passing any filters, so my interpretation would be you have 71.3% of reads left after merging and not chimeric. That is not very high but not unreasonably low. (But really, the table you showing is confusing for me sorry ...)

Did yo trim the reads before importing in qiime2? Or you did specify the trimming length within the dada2 command? If so, which trimming length you set for forward read?

Just one more question on the prevalence you pointing out, you number of samples is quite high, are these processed at the same time as well as run on the same sequencing lane?
If, as I suspect, they were split on several batch you should denoise each single batch and then merge the results. Could you confirm on this? EDIT: please note that you should use same denoising parameters for all the batches!

Let wait for @thermokarst point of view when he can, but I am sure the above answers it will be very useful too him too!

Hope it helps

Claire010 · September 29, 2020, 11:42am

@llenzi Thank you so much for your comments and suggestions.

On the statistics you are showing, I am not sure if I would interpret as you are doing, but I’d like to see the qiime2 artefact to get to be sure. It usually returns the sequences passing any filters, so my interpretation would be you have 71.3% of reads left after merging and not chimeric. That is not very high but not unreasonably low. (But really, the table you showing is confusing for me sorry …)
@llenzi Sorry for the confusing table. You're right, 71.3% of reads left merging and not chimeric.

Did yo trim the reads before importing in qiime2? Or you did specify the trimming length within the dada2 command? If so, which trimming length you set for forward read?
@llenzi I didn't trim before QIIME2. I use the trim argument within dada2. No trimming length was se for forward read.

Just one more question on the prevalence you pointing out, you number of samples is quite high, are these processed at the same time as well as run on the same sequencing lane?
If, as I suspect, they were split on several batch you should denoise each single batch and then merge the results. Could you confirm on this? EDIT: please note that you should use same denoising parameters for all the batches!
@llenzi These samples are run on 2 run, 4 lanes. I denoise them together. If denoise separately for each batch, then how to combine the ASV tables? If directly combine the 4 ASV tables for 4 batches, some ASVs with different names actually are the same ASV? Any tutorial or guideline for processing multiple batches with QIIME2?

llenzi · September 29, 2020, 12:45pm

Hi @Claire010,

Yes you should denoise the data per lane, keeping the same trimming length in dada2. As long as you keep the same trimming settings, ASVs from different lanes with same sequences will be recognised as the same (the ASV name is in fact a coded, hashed, version of the sequences, hence same name == same sequences!).

For merging afterward, please look at the 'feature-table merge' (merge: Combine multiple tables — QIIME 2 2020.8.0 documentation) and 'feature-table merge-seqs' plug in (merge-seqs: Combine collections of feature sequences — QIIME 2 2020.8.0 documentation)
Hope it helps

Claire010 · September 30, 2020, 2:29am

Hi @llenzi,

Really appreciate your suggestions and help. I will have a look at it.

Just wondering, will denoising 4 lanes together result in high number of chimeras? How does DADA2 define an ASV as chimera? mapping to the 16S database?

llenzi · September 30, 2020, 9:10am

Hi @Claire010

In my understanding dada2 performs a de-novo identification of the chimeric sequences, testing which maps with more than one within the single dataset (but please do not consider this as a very precise definition ).
In your case, it will define chimeric sequences within each lane. The point of denoising the lane separate is because each lane it may apply its own single bias to the data, so to compensate for this you may run the sample pool on all the 4 lanes (so to have all biases applied equally to all samples ) or denoise each lane as separate so dada2, in this case can detect the noise for the single run and not being confuse by a mixture of possible different noises.

In theory, you should denoise as separate group a pool of samples processed as a single group form the very beginning: extraction kit lot, same PCR reagent and so on up to the same sequencing lane!
To answer your question on denoising together resulting in higher number of chimeras, I can not say really, it will depend on lots of factor I'd say!

Cheers

system · October 31, 2020, 3:10pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.