When running my samples through the Qiime2 pipeline, I noticed a large loss of reads for certain samples following denoising/clustering in dada2. To get a better idea of what is going on I decided to run my samples using the standalone version of dada2. The samples that previously lost a lot of reads, were losing most of their reads at the chimera removal step. Ben Callahan suggested that this may be due to primer contamination. However, my sample libraries were created using EMP 16S V4 primers and posts on this forum have stated that in the EMP protocols the primers are not sequenced.
I decided to follow the guidelines in dada2 to remove primer contamination, using cutadapt. This found a lot of contamination by forward and reverse primers in the reverse complement orientation. Following primer removal, I repeated the dada2 workflow but found similar levels of read loss in these problematic samples. However, now the reads are being lost at the filter/trim and read merger stages, rather than the chimera removal stage.
I am unsure how to proceed now and have been circling around this issue for a while. Has anyone else come across something like this? What could be causing the primer contamination? Should I use the samples that show high read loss or will they be hopelessly biased in their microbiome profile? Would there be a way to retain more reads for these samples?
Did you use the full EMP protocol for library preparation, or only use the V4 primers? E.g., did the primer have barcodes and Illumina adapters attached, or did you use a follow-up library preparation step?
This sounds like progress. It is a new error, but one that can be more easily optimized, e.g., by adjusting trimming parameters to remove low-quality sections.
If the reads are being lost during merging, then this will introduce a significant bias. Otherwise the number of reads remaining is probably more important than the number of reads lost for assessing whether these samples are suitable (e.g., use alpha rarefaction to see if you still achieve a reasonable level of sequencing depth).
The library prep followed the EMP protocol. Only a single PCR step, barcodes (forward only) and illumina adapters were attached.
So to optimise from here, my best strategy would be to try and remove the low quality sections of reads, whilst still trying to retain as many reads overall during the read merger stage of dada2? What settings would people recommend altering to do this?
Right now, I have used Figaro to optimise my trimming/filtering parameters. The truncLen parameter is set to F = 187, R = 133. MaxEE is 1, 1. maxN = 0, truncQ = 2. However, this was optimised with the primer contamination still present in the data and this optimised parameter set may now be inaccurate.
Should I perhaps set the truncLen parameters to 0 (default) and instead use a more stringent truncQ value (10 or 20?).
I will check to see that alpha rarefaction can be done in Qiime2
Yes truncQ is fine as you have paired-end reads, but otherwise you could trim at a defined position based on average quality profiles. It sounds like you are having issues merging, so more stringent truncation will hurt, not help.