Many chimeric reads after dada2, but only in some samples

alfanon · May 25, 2018, 11:58am

Dear all,

I have run dada2 on my paired end illumina reads (2x300bp) (already demultiplexed by the sequencing center) with the following command:
qiime dada2 denoise-paired --i-demultiplexed-seqs demux.qza --p-trim-left-f 17 --p-trim-left-r 21 --p-trunc-len-f 296 --p-trunc-len-r 220 --o-representative-sequences rep-seqs-dada2.qza --o-table table-dada2.qza --o-denoising-stats stats-dada2.qza --p-n-threads 8
Everything seems to have worked fine, but visualizing the dada2 stats with 'qiime metadata tabulate' I have realized that for many samples only few chimeric sequences have been identified and removed, while for some other samples almost half of the sequences have been identified as chimeric. Do you have any idea why this happened? Why such big differences? It is maybe a problem in the way I trimmed/truncated the reads in dada2? I have check the fastqc traces of samples where many chimeric sequences were identified and of samples where only few chimeras were identified and in terms of quality they seem similar. I am attaching the demux summarize output and the stats-dada2.qzv.
Thanks a lot in advance for any advice on this
demux2.qzv (298.0 KB)
stats-dada2.qzv (1.2 MB)

Cheers
Niccolò

Mehrbod_Estaki · May 25, 2018, 5:45pm

Hi @alfanon,

Thanks for providing your artifacts! They are very helpful.
My initial thoughts are that this may just reflect true biological content and to not really worry about it. But let's take a closer look to be sure. A few additional questions first:

Has there been any other type of quality control done prior to dada2? Sometimes sequencing centers apply their own QC which might interfere with DADA2.
Your reads are in quite good shape and your trimming/truncating parameters look fine to me. What region do your primers target (16S, 18S..) and what is the expected paired overlap coverage?
What is the sample type being looked at here?
Are all the samples from the same sequencing run or is it a combination of multiple runs merged?

alfanon · May 28, 2018, 1:16pm

Dear @Mehrbod_Estaki,

thanks for your answer. That's also my guess at the moment. The replies to your questions:

I have checked that, and the reads I used are the raw reads coming from the Illumina machine, no QC was done by the sequencing facility. It was a 2x300 bp. I was a bit surprised that a small fraction of reads are shorter than 300 bp but they assured me that they did not do any QC.
It' s a 16S with primers 341F-805R. Amplicon is expected to be 460bp long, so with a 2x300, I expect a 140 bp ovrelap, if I am not wrong. Anyway when I did my calculations for the trimming/truncation in dada2, I considered the nt that would be trimmed and truncated at the ends, and I have got a shorter "available" overlap for merging the reads.
the sample type is mosquito's gut. Interestingly, the samples with many chimeras identified are water samples which were included in the sampling. That could be maybe the reason? Maybe with less material to amplify the PCR created more chimeras in these samples?
All samples come from a single sequencing run.

Thanks for helping
Cheers

Mehrbod_Estaki · May 28, 2018, 5:20pm

Hi @alfanon,

Thanks for the follow-up.

A couple of things come to mind though none of them are particularly worrying in my opinion. First, the tail end of the reads (as is expected with Illumina) might just have been low enough in quality that the sequencer automatically excluded them, this wouldn't even need manual QC. I've also experienced this with host contamination of the reads. Once merged, you can blast a few of those shorter reads to see what they are hitting. If they are host, or some unknown target, we can filter them pretty easily if needed.

I think based on your trim/truncating parameters the minimum overlap required for DADA2 is preserved so that shouldn't be a problem. If you were worried about this though, you can always just use your forward reads which seem to be in great shape!

Aha! I think your deduction is spot on here. If I had to guess, I'd say this is it. The probability of chimeras forming in a pool of nucleotides and primers is certainly higher when there is no real target. I also kind of recall this was especially true if a high fidelity polymerase wasn't used for the PCR.
Overall, I don't think this is an issue of the DADA2 algorithm, but rather something from your preparation and even then I wouldn't worry about it and just carry on with your analyses, taking care to filter out those spurious reads. I'd be interested in what @Nicholas_Bokulich's thoughts are on this though.

Nicholas_Bokulich · May 28, 2018, 5:55pm

I agree. I think it is unfortunate that so many reads are being identified as chimera, but this is almost certainly related to prep protocol and sample characteristics, not dada2, and should just carry on with your analyses. (dada2 will remove those chimeras automatically).

I like your thoughts @alfanon @Mehrbod_Estaki that low biomass in the water samples could have led to higher chimera counts in those samples.

If in doubt, you can always get a "second option" and compare you results to processing with an alternative method...

but I would point out that your data look beautiful in other ways. Very nice, high read counts. In your position, I would just move on with the analysis and not worry about the high chimera levels in some samples.

alfanon · May 29, 2018, 7:08am

Thanks a lot @Mehrbod_Estaki and @Nicholas_Bokulich for your suggestions and feedbacks!
Your feedbacks confirmed what I suspected, that the high number of chimeras spotted reflects a problem with the way some of my samples were prepared.
Just one last thing @Mehrbod_Estaki: you said that you have already experienced shorter reads (even before QC) in case of host contamination. Wouldn't those reads/merged reads be excluded anyway from the analysis after the filtering done by dada2?

Mehrbod_Estaki · May 29, 2018, 8:13am

Hi @alfanon,
Sorry that was just poor sentence structuring on my part. I didn't mean that I had shorter reads before QC. I meant to say I've experience denoised merged reads from dada2 to be shorter than the expected length, and those turned out to be host contamination. The full discussion to that issue is here if you are interested in reading though that doesn't seem to be your problem here.

Mehrbod_Estaki · May 29, 2018, 6:48pm

An off-topic reply has been split into a new topic: Dealing with unassigned and Kingdom-only assigned features

Please keep replies on-topic in the future.

system · June 30, 2018, 12:49am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.