I have finished ASVs estimation, but when tracing back the performance of the pipeline I have noticed that in some cases I lost up to 90% of raw sample size. The most critical step is chimera detection, where up to 50% sequences are lost.
I have the following questions regarding primer sequences effect in ASV estimation:
1.- How is it correct to detect if I still have primer sequences in raw fastq?
Is it valid to search primer sequences with this code:
grep -c 'GACTAC\S\SGGGTATCTAATCC' fast.fastq
Its output counts the presence of the pattern in almost all sequences.
2.- If primer sequences are indeed present in raw files, how do they affect chimera detection?
I have read that primer sequences interfere with ASV quimera detection method, which is -removeBimeras- but I do not understand exactly why.
I understand that removeBimera aligns test sequences to the two most abundant sequences, if it detects an important alignment (16 bases) betqeen test sequence and any or both parental sequence, this wil flag the test sequence as chimera.
Does this could mean that higher chimera detection if files with primer sequences is due to higher rate of alignment between primer sequences of test sequences and parental sequen
Great work sleuthing on all of this, I think you've found the root cause here, so my reply mostly going to be agreeing with your analysis of things :)
You handled the degenerate nucleotides nicely in this expression, so yeah that pretty much tells you there's a problem. It doesn't fix it yet, more on that below.
You've got it exactly right here. It notices that if it presumes that the primer is fragment A, then the distribution of the reads looks super wrong, unless we're looking at chimeras like A:B, A:C, A:D, etc. If you think about it, a primer technically is a PCR chimera, just a purpose built intentional one.
What you'll need to do next is trim these primer/adapter sequences using cutadapt (or honestly your favorite tool if you have one) and re-run DADA2. We have a QIIME 2 plugin called q2-cutadapt which can help, this is probably the method you want:
It's a bit involved, so you may need to refer to the cutadapt docs for complete details. As a heads up we use the long-form (--name) of the cutadapt options, wheras the docs tend to use the short-form (-N), so be prepared for that.
Another thing to look out for will be the reverse-complemented reverse-primers on the forward read and vice-versa depending on your read lengths and amplicon of interest. It's usually worth attempting to filter those out as well, as they can flag your shorter amplicons as spuriously chimeric.
I performed the search in reverse reads with the corresponding primer and it also gave the similar result, must sequences (above 90%) had the match with the primer sequence.
Thanks for the recommendation with cutadapt !
I have the following questions:
1.- I was wondering if it could be valid to trim the first 20 nucleotides of both forward and reverse reads, which would remove almost all the primer sequence as well as the degenerate sites.
Is this valid/correct?
2.- I am not sure if I understand why a primer is a PCR chimera D:. I regard 16S primers as consensus sequences that allow some mismatches to align with as most microbes' 16s fragment as possible. Could you elaborate more on this exlanation?
I would highly recommend using a tool like cutadapt to trim out your primers! It will better handle edge cases and weird issues compared to just trimming, especially if you know your primer sequences.
A chimera is two unrelated sequences that have been "tangled" together and sequenced as one sequence. A PCR primer is a unrelated sequences we attached to our sequences on purpose. So similarly, a PCR primer and the real sequence are two unrelated sequences.
If your PCR sequence is identified as Sequence A then every sequences that has a PCR primer is a combination of Sequence A and another sequence. Therefore all of them are being identified as chimeras.