Dada2: merging plates and chimera removal

colinbrislawn · October 16, 2021, 11:30pm

Hi, Peat! Welcome to the forums :qiime2:

I think both ways (chimera removal per-run or across all runs after merging) are common and accepted methods. This is because, in practice, the results are similar, just like you said:

I'm not sure there's a way to do this using the Q2-dada plugin... You could do this directly with DADA2 in R, or using the Qiime2 vsearch plugin, see vsearch uchime-denovo and uchime-ref.

While we are on this topic, it's worth mentioning why it's preferable in theory to detect and remove chimers per-run, and why it's comparable in practice to do it later on.

How are chimeras formed?

'chimeras' are thought to be a technical artefact of the PCR reaction. From PMC3044863, Figure 1, summarized on Wikipedia

It occurs when the extension of an amplicon is aborted, and the aborted product functions as a primer in the next PCR cycle. The aborted product anneals to the wrong template and continues to extend, thereby synthesizing a single sequence sourced from two different templates.

How can you detect and remove these artificial hybrids?

Given that:

each chimera is composed from real amplicons in a sample, and
more common amplicons should cause more chimeras, and
these fake 'children' chimeras should be less abundant than their real 'parent' amplicons

Then:

You could look for less common amplicons that could be explained as a combination of more common amplicons, and label them a chimeric!

When is the best time to find (and remove) chimeric reads?

After dereplicating each sample separately: PCR is performed separately on each sample, and causes chimera formation separately on each sample, so you could find and remove chimeras from each sample! (I don't think any pipelines do this, because...
After denoising each sample separately: same logic as above, but now we have removed noisy reads for a smaller data set and faster chimera finding!
After denoising all samples on a single run: because the same features are often the most abundant across samples, and we are just looking for less abundant 'children' from the most abundant 'parents', we might as well do this just once for each run.
After denoising and merging all feature tables: same logic as above, but now we only have to do this parent-child search 1 time in the whole pipeline. #YOCCO

I hope that helps, but if that raises more questions than answers, let's keep this discussion going!

Colin

P.S. After listing those options, I'm starting to think that we should be doing the chimera search earlier in our pipelines. Has someone tried this using modern ASV / denoising methods and shown that it's identical, because I can't find a citation that users ASVs...