Dada2 filtering out >80% of reads as chimeras!

Quick update, I have tried to check whether the reverse complement reverse primer is present in my forward reads and I couldn’t seem to find it.

(to do this I opened the forward reads fasta in notepad++, then ctrl+f, my primer sequence, it uses one degenerate base (K) so I tried both possible variants). Would this have actually worked, or is there a better way to go about this?

2 Likes

Hello Sam,

Thanks for the update.

You could try using the cutadapt plugin which is designed to filter and trim adapters and primers.

Colin

3 Likes

Thanks for the advice (sorry for the extra questions!), I can’t really see if/how I can use cutadapt in isolation? Is it a component of demux-paired/singe & trim-paired/single? If so this dataset has been passed through demux already, do you think it is worth attempting with a trim-paired step?

Hi @Sam_Prudence!

Yep!

Correct, try trim-paired.

:qiime2:

2 Likes

Thanks I will give that a shot!
Is it okay if I just ask for some advice on how to run trim-paired? Under the --p-front-f & --p-front-r options, am I supposed to give my adaptor sequence, or can I give it my PCR primers so that it removed everything upstream of them?

Is there any way of me finding out my adaptor sequence?

1 Like

I would suggest looking at the cutadapt docs --- Cutadapt — Cutadapt 4.7 documentation --- that describes their filtering semantics --- q2-cutadapt simply exposes the parameters defined by cutadapt. Both of those examples in your question are cases covered in those docs.

Not in cutadapt, to my knowledge. Tools like FastQC can help identify illumina-specific non-biological-sequence, YMMV.

1 Like

The adapter shouldn't actually be necessary to know as those sequences flank your primers which you do (or at least should) know. So if you trim everything before your forward primer and everything after your RC'ed reverse primer, you will remove all non-biological sequence :slight_smile:

3 Likes

Hi Both,

if you trim everything before your forward primer and everything after your RC’ed reverse primer, you will remove all non-biological sequence :slight_smile:

Great thanks! I will try this now and report back with my findings.

Good point @ebolyen! One thing to keep in mind though @Sam_Prudence is this suggestion might not work as expected when using an anchored adapter — check out the cutadapt docs above for more details.

2 Likes

Hi all,

Thanks again for your replies! I have ran the sequences through cutadapt (see below for code), and then passed it through dada2 again, and I am still losing the same proportion at the chimera checking step. Is what I ran correct?

qiime cutadapt trim-paired --i-demultiplexed-sequences demux.qza --p-front-f CCTAYGGGRBGCAACAG --p-front-r GGACTACNNGGGTATCTAAT --p-error-rate 0 --o-trimmed-sequences trim-paired-seqsA.qza --verbose

Another interesting development…
I have ran both the forward and reverse reads through dada2 denoise single separately, treating them as single end reads. I submitted these as scripts to our cluster as it can take >10 hours for dada2 to process this data-set sometimes, so I don’t have any details of how many reads were filtered at each step. What I found however is that the forward reads retained ~70% of the sequences, whereas I only retained ~12% of the reverse reads. I have ran some of the most abundant features through BLAST and nothing looks like a chimera! Could the issue here be something to do with my reverse reads? Or some merging problem resulting in chimeras?

Maybe. I have had such a problem and it happened to be a truncating + merging issue. In my case I was truncating too much and letting a smaller than the necessary overlap (<20 bp). In addition, the overlapping region in my reads is already small, 30-40bp, se check it out if that's not the case of your reads.

In addition, the overlapping region in my reads is already small, 30-40bp, se check it out if that’s not the case of your reads.

Thanks for the suggestion! I have looked into this and am now even more confused.. My amplicon is 167bp, yet somehow my reads are over ~245bp... (looking at the reads after removing adaptors/primers, no other processing). Judging by the output from previous dada2 runs (see above) the merging seemed to go okay (although if my reads are that length my p-trunc-len parameters wouldn't have left much overlap

What did the verbose output look like? The cutadapt logs indicate how many times a subseq was removed --- that will be a good indication of how "well" it worked.

1 Like

I don’t remember seeing a table or anything, but it appeared to run to completion (no error messages etc) and it took a few hours to complete, other than that I am not entirely sure… sorry! (it was a while ago and I am afraid I did not save it!)

The output (when run with --verbose) is quite lengthy from this plugin — maybe you could re-run the command and take a peek?

1 Like

Hi Ryan,

No problem I will when I get a chance (I’m currently progressing with annotating the forward reads, attempting to train a classifier using the SILVA database, which is proving challenging! I’ll report back after this is done)

Thanks again for all the help

Hi Sam,

in dada2, it is recommend to use a value greater than or equal to 1 for the commoan “–p-min-fold-parent-over-abundance FLOAT”. Thus can I ask the reason why you used a 0.75 for you data analysis?

I am not 100% understand the command, but please check this out: https://github.com/benjjneb/dada2/issues/602
Ben @benjjneb mentioned that a higher value (e.g. 4/8) could prevent the FP chimera.

And also please check my post: The meaning of DADA2 command "--p-min-fold-parent-over-abundance FLOAT"

Recently I run into the similar issue, the sequence after merging is still good, bot after the chimera check, it lost ~80%. I tried to trim more sequences off and use the “–p-min-fold-parent-over-abundance 8” ~80% sequences are retained. I am now doing the taxonomy assign. I will keep updating.

3 Likes

Hi Yaochun Yu,

I have the same issue, I am trying to trim more sequences using “–p-min-fold-parent-over-abundace8”, my question is does this increase the time required to more than one day in order to get the denoising stats?

Regards,
Maysa

Hi Yaochin,

Thanks for flagging this up, the main reason I altered “–p-min-fold-parent-over-abundance” was to test whether this parameter was responsible for my reads being discarded, by lowering the threashold. In the end, I did not use this line for the final analysis (and we have actually since repeated this experiment and have much higher quality sequencing coverage on some indepentant replicates of these samples), I did not appear to have the same chimera issue in the second dataset.

Hello Sam,

I am afraid what you are looking at is probably what you think, PCR artifacts caused by high number of cycles, low starting material and possibly contamination.

Some amount of chimeras inevitably arise in process of PCR and they compete with your sequences for amplification. Hence high starting material and low cycle advices.

The fast that first time this scheme worked also points to possibility of contamination of working environment. I also had times when PCR worked once but when I do same PCR I don't get what I need, only strange short fragments. It can be prevented by using filter tips and sticking to general PCR guidelines, but completely removing contamination is not always possible. Again, high starting material and low cycles reduce effect of contamination.

1 Like