I have V3V4 region 2x250bp paired end sequencing data from mouse feces. I've checked to make sure my reads do not contain any non-biological sequence, and based on the qc plots, I chose to trim the first 5 bases of each read, and truncate at 249bp. I ran the following dada2 command, which resulted in the loss of about 80% of the merged reads due to chimera detection. The other metrics look pretty good, so it would seem to be a problem with the last step.
Running the same command but without trimming the first 5 bases resulted in significantly less loss to chimera detection. Even though it is significantly better, I'm still losing more sequences to chimera detection than I expect to, given similar questions on the forum. Is there anything I'm doing wrong, and is there a way to increase yield?
I believe that all the non-biological sequence has been removed, but I left the biological portion of the primers. The forward reads start with ‘CCTACGGGNGGCWGCAG’ and the reverse reads start with ‘GACTACHVGGGTATCTAATCC’, which based on my understanding are the biological portion of the Illumina miseq V3V4 primers. What’s odd is that while other people report removing the primers improved results, in my case it worsens chimera filtering.
Is there a good way to determine if the min-fold-parent-over-abundance default setting is non-optimal for a sample? Is stand alone dada2 the only way to manually inspect chimera detection? How do you determine if the sequences being filtered aren’t actually chimeras? Recovering more reads isn’t worth introducing a bunch of chimeras.
They should be removed anyway (as recommended by the dada2 developers, not our policy!), but it’s strange that removing the primer worsens chimera filtering (probably because it alters the number of unique features and their abundances, and hence the fold-parent-over-abundance ratios)
I am not sure that there is an easy way to determine what is “correct” unless if you are using simulated/mock community datasets to optimize this. The dada2 developer has recommended on this forum to adjust that parameter if chimera filtering is too high, so you should just adjust and look at the dada2-stats output to see how this alters the non-chimeric read yield.
No, chimera counts are reported in the dada2-stats output generated by q2-dada2 as you have seen. But if you mean inspect the chimeric seqs themselves, then no this is not something q2-dada2 allows.
Using simulated and/or mock community data! Quite a process to really optimize.
Amen! But 75% plus is probably false-positive detection… 10-20% is probably a more reasonable “normal” average to aim for (unless if you are expecting high amounts because of some characteristics of your samples and/or library prep protocol)
Sorry I can’t give more specific guidance… maybe @benjjneb or others have other ideas for optimizing this on data with unknown compositions.
Nothing to add except to reiterate that you must remove the primers. The entire primers, not just 5bps. The primers aren’t sequences from the sample, they are sequences that were added into the PCR reaction, and the ambiguous nucleotides in the primers interfere with denoising and chimera detection.