I have a best practices question involving the determination of how much trimming to apply in the DADA2 process to get the best amplicon building. I recently had some MiSeq V4 region paired end data (2x250) that I pushed through Qiime2. Before starting I analyzed this data using FastQC, looking for any sign of barcode or adaptor left in these sequences. I also did a simple regex search looking for the 2x8bp barcode sequences that were used. I found no overrepresented sequences in FastQC and my regex turned up no matches. This data was produced by our center’s production group that is usually good about cleaning all the data before they hand it off to analysts, so at this point I assumed that no trimming was necessary and I ran DADA2 with both the trimming & truncation settings at 0. I should mention here that I checked the read quality per base in the FastQC results as well, and it was good quality for the entire 250bp. And the qiime import visualization of the aggregate quality also agreed. There was a bit of drop at the end, but it was overall very high quality.
This attempt (with no trimming or truncation) had some problems merging. There was a high chimera rate ( around 50% lost as chimeric) and the taxonomic classification showed very low diversity (only 8-20 taxa seen per sample). Also, the most telling sign that I had a problem, when I dumped the OTU table, all my OTUs were unique per-sample. There was no overlap of OTUs between samples. This seemed pretty clear that there must be some sample-specific barcode left in these reads.
So I went back and trimmed/truncated 20bp off both ends. Enough to ensure that the 2x8bp barcodes would be removed from either end. After doing that, the merging was great (~15% or less chimeras) and the taxa classifications were good (80-100+ per sample) and the OTUs had significant overlap between samples as we expected.
So my question here is what should I be using to screen my reads when making a decision of how much to trim? I had thought I was being careful by checking with FastQC and using a regex to search for the barcode sequences. But whatever was in there was not being seen by either of those methods. My goal is to have a robust method I can apply to determine the minimum amount I need to trim to clean my data and get good amplicon reconstruction. Any advice would be appreciated!
Thanks,
John Martin