Dada2 filtering out >80% of reads as chimeras!

Sam_Prudence · January 17, 2019, 1:25pm

Just hoping to get some advice, after running our 16S amplicon sequencing data through dada2 it seems to be concluding >80% of the reads are chimeras, leaving us with too few reads to make any robust observations. We were already suspicious the data might have some chimera problems.. but I would like to be confident that there are in fact this proportion of chimeras before I conclude it is a bust.. any advice? See below for the code & demux.qzv file. I am running this on qiime2-2018.2.

Code-
qiime dada2 denoise-paired --i-demultiplexed-seqs demux.qza --o-table table --o-representative-sequences rep-seqs --p-trim-left-f 20 --p-trim-left-r 20 --p-trunc-len-f 120 --p-trunc-len-r 135 --p-chimera-method consensus --verbose

demux (1).qzv (290.2 KB)

colinbrislawn · January 17, 2019, 7:26pm

Hello Sam,

Thanks for providing the qzv file.

Have you taken a look at the qiime dada2 denoise-paired documentation? You could try changing some of these settings like --p-min-fold-parent-over-abundance to see how it changes your chimer detection levels.

Ah, why do you say that? Any clues you can share about the source of this issue? Like maybe high PCR cycles causing more chimeras?

Colin

P.S. Install the newest version of Qiime

Sam_Prudence · January 17, 2019, 9:43pm

Hello Colin,

Thanks for getting back to me!

I have seen the documentation yes, but I hadn't thought about changing the settings for --p-min-fold-parent-over-abundance, mostly because I am not entirely sure I understand what it does (forgive me I and quite new to this type of analysis). When it says the potential parents of a sequence being tested, does it mean the abundance of other sequences that could potentially have become fused? And would the sensible change to test be to lower the number to something like 0.5?

The sequencing service we used performed their own analysis on the data and sent this to us, we always re-do the analysis ourselves anyway, but it's still useful to look at. When I ran those OTUs through BLAST a significant proportion of the ones I tested came through as chimeras, so we were already a little worried! There is a relatively high number of PCR cycles, we sent the service a 900bp PCR product (w/ 35 cycles of PCR), and they then amplify a 120bp product to sequence. The strange thing is we have used this service previously, with the exact same primer sets and (on our end) the exact same PCR conditions, and not had any issues like this, so we are a little perplexed.

Thanks,
Sam

P.S. I will ask our cluster admins if they can update qiime2 for us!

Sam_Prudence · January 18, 2019, 4:25pm

Quick update, I have re-run the code after tweaking the --p-min-fold-parent-over-abundance (I changed it to 0.75), and that did improve the number of sequences I end up with, but I still loose a large proportion of them at the chimera detection step (still over 80%).
Is it likely I have more chimeras within those reads now that I have changed --p-min-fold-parent-over-abundance?

                         input filtered denoised merged non-chimeric

BS1.A_5_L001_R1_001.fastq.gz 144314 141104 141104 94696 11543
BS2.A_6_L001_R1_001.fastq.gz 146582 143241 143241 95915 11609
BS3.A_4_L001_R1_001.fastq.gz 154184 150785 150785 102300 12189
E1.A_3_L001_R1_001.fastq.gz 132018 129421 129421 98352 18706
E2.A_9_L001_R1_001.fastq.gz 125842 123704 123704 92818 16905
E3.A_1_L001_R1_001.fastq.gz 134513 131866 131866 100629 16497

Code-
qiime dada2 denoise-paired --i-demultiplexed-seqs demux.qza --o-table table --o-representative-sequences rep-seqs --p-trim-left-f 20 --p-trim-left-r 20 --p-trunc-len-f 120 --p-trunc-len-r 135 --p-chimera-method consensus --p-min-fold-parent-over-abundance 0.75 --verbose

colinbrislawn · January 18, 2019, 6:11pm

Hello Sam,

We are deep into the investigatory stage of this discussion. I don't think I'm going to have an 100% solid answers, but I think you are asking all the right questions and I wanted to raise a few questions of my own.

That combo of 35 PCR cycles followed by a second amplification could cause chimeras. I remember hearing a recommendation to use <25 cycles and more starting DNA in order to mitigate this issue, but I haven't done PCR myself is years. This still doesn't explain why is worked last time, but not this time.
I wonder if we are maybe (hopefully!) misreading that table. Maybe there are 94,696 denoised reads in the first sample, and these are captured in 11,543 ASVs by dada2. This huge reduction in which a smaller number of ASVs represent a much larger pool of reads is the goal of dada2. So if I'm confusing the columns, we might not have a problem at all!
It's more important to track number of reads removed, than the number of ASVs removed. I often lose many of my features to chimeras (10-30%), while only losing a small fraction of reads (1-5%). Removing lots of chimeric features is a good thing! You don't want them in your data!

Let me know what you find. Maybe we could also ask the dada2 dev to take a look at this and offer any advice.

Colin

ebolyen · January 18, 2019, 9:11pm

Hey @colinbrislawn and @Sam_Prudence !

Unfortunately, you are both reading the table correctly. Losing ~90% in a single step is a good sign that something went wrong.

If I remember correctly, DADA2 will have a harder time with chimera's if there is any non-biological sequence in the read.

Are we certain that there are no primers/adapaters in the sequence? You may also need to check for the reverse-complemented reverse-primer on the forward-strand (and vice-versa for paired-end) if your amplicon is short enough to be sequenced through the reverse adapters (common with ITS).

Sam_Prudence · January 21, 2019, 1:28pm

Hi Both,

Thanks for your responses!

That combo of 35 PCR cycles followed by a second amplification could cause chimeras. I remember hearing a recommendation to use <25 cycles and more starting DNA in order to mitigate this issue, but I haven’t done PCR myself is years. This still doesn’t explain why is worked last time, but not this time.

Yes I was worried about this! From some of our sample types we get very low yields DNA with this primer pair, hence the higher number of cycles.

Are we certain that there are no primers/adapaters in the sequence? You may also need to check for the reverse-complemented reverse-primer on the forward-strand (and vice-versa for paired-end) if your amplicon is short enough to be sequenced through the reverse adapters (common with ITS).

My primers are 15 (fw) & 17 (rev) bases, so using a --p-trim-left-f/r of 20 should be plenty to remove the primer sequence right? The amplicon is only 167 base-pairs so they could include the reverse-complemented reverse-primer. Would my --p-trunc-len exlude this by truncating at 135bp?

Sam_Prudence · January 21, 2019, 1:28pm

Quick update, I have tried to check whether the reverse complement reverse primer is present in my forward reads and I couldn't seem to find it.

(to do this I opened the forward reads fasta in notepad++, then ctrl+f, my primer sequence, it uses one degenerate base (K) so I tried both possible variants). Would this have actually worked, or is there a better way to go about this?

colinbrislawn · January 21, 2019, 6:31pm

Hello Sam,

Thanks for the update.

You could try using the cutadapt plugin which is designed to filter and trim adapters and primers.

Colin

Sam_Prudence · January 22, 2019, 9:33am

Thanks for the advice (sorry for the extra questions!), I can't really see if/how I can use cutadapt in isolation? Is it a component of demux-paired/singe & trim-paired/single? If so this dataset has been passed through demux already, do you think it is worth attempting with a trim-paired step?

thermokarst · January 22, 2019, 4:29pm

Hi @Sam_Prudence!

Yep!

Correct, try trim-paired.

:qiime2:

Sam_Prudence · January 22, 2019, 5:27pm

Thanks I will give that a shot!
Is it okay if I just ask for some advice on how to run trim-paired? Under the --p-front-f & --p-front-r options, am I supposed to give my adaptor sequence, or can I give it my PCR primers so that it removed everything upstream of them?

Is there any way of me finding out my adaptor sequence?

thermokarst · January 22, 2019, 11:31pm

I would suggest looking at the cutadapt docs --- Cutadapt — Cutadapt 5.0 documentation --- that describes their filtering semantics --- q2-cutadapt simply exposes the parameters defined by cutadapt. Both of those examples in your question are cases covered in those docs.

Not in cutadapt, to my knowledge. Tools like FastQC can help identify illumina-specific non-biological-sequence, YMMV.

ebolyen · January 23, 2019, 12:17am

The adapter shouldn't actually be necessary to know as those sequences flank your primers which you do (or at least should) know. So if you trim everything before your forward primer and everything after your RC'ed reverse primer, you will remove all non-biological sequence

Sam_Prudence · January 23, 2019, 10:25am

Hi Both,

if you trim everything before your forward primer and everything after your RC’ed reverse primer, you will remove all non-biological sequence

Great thanks! I will try this now and report back with my findings.

thermokarst · January 23, 2019, 4:21pm

Good point @ebolyen! One thing to keep in mind though @Sam_Prudence is this suggestion might not work as expected when using an anchored adapter --- check out the cutadapt docs above for more details.

Sam_Prudence · January 25, 2019, 9:32am

Hi all,

Thanks again for your replies! I have ran the sequences through cutadapt (see below for code), and then passed it through dada2 again, and I am still losing the same proportion at the chimera checking step. Is what I ran correct?

qiime cutadapt trim-paired --i-demultiplexed-sequences demux.qza --p-front-f CCTAYGGGRBGCAACAG --p-front-r GGACTACNNGGGTATCTAAT --p-error-rate 0 --o-trimmed-sequences trim-paired-seqsA.qza --verbose

Another interesting development..
I have ran both the forward and reverse reads through dada2 denoise single separately, treating them as single end reads. I submitted these as scripts to our cluster as it can take >10 hours for dada2 to process this data-set sometimes, so I don't have any details of how many reads were filtered at each step. What I found however is that the forward reads retained ~70% of the sequences, whereas I only retained ~12% of the reverse reads. I have ran some of the most abundant features through BLAST and nothing looks like a chimera! Could the issue here be something to do with my reverse reads? Or some merging problem resulting in chimeras?

lca123 · January 25, 2019, 11:09am

Maybe. I have had such a problem and it happened to be a truncating + merging issue. In my case I was truncating too much and letting a smaller than the necessary overlap (<20 bp). In addition, the overlapping region in my reads is already small, 30-40bp, se check it out if that's not the case of your reads.

Sam_Prudence · January 25, 2019, 11:40am

In addition, the overlapping region in my reads is already small, 30-40bp, se check it out if that’s not the case of your reads.

Thanks for the suggestion! I have looked into this and am now even more confused.. My amplicon is 167bp, yet somehow my reads are over ~245bp... (looking at the reads after removing adaptors/primers, no other processing). Judging by the output from previous dada2 runs (see above) the merging seemed to go okay (although if my reads are that length my p-trunc-len parameters wouldn't have left much overlap

thermokarst · January 28, 2019, 10:03pm

What did the verbose output look like? The cutadapt logs indicate how many times a subseq was removed --- that will be a good indication of how "well" it worked.