Finding barcodes and adapters in amplicon sequence (to determine how much to trim in DADA2)

I have a best practices question involving the determination of how much trimming to apply in the DADA2 process to get the best amplicon building. I recently had some MiSeq V4 region paired end data (2x250) that I pushed through Qiime2. Before starting I analyzed this data using FastQC, looking for any sign of barcode or adaptor left in these sequences. I also did a simple regex search looking for the 2x8bp barcode sequences that were used. I found no overrepresented sequences in FastQC and my regex turned up no matches. This data was produced by our center’s production group that is usually good about cleaning all the data before they hand it off to analysts, so at this point I assumed that no trimming was necessary and I ran DADA2 with both the trimming & truncation settings at 0. I should mention here that I checked the read quality per base in the FastQC results as well, and it was good quality for the entire 250bp. And the qiime import visualization of the aggregate quality also agreed. There was a bit of drop at the end, but it was overall very high quality.

This attempt (with no trimming or truncation) had some problems merging. There was a high chimera rate ( around 50% lost as chimeric) and the taxonomic classification showed very low diversity (only 8-20 taxa seen per sample). Also, the most telling sign that I had a problem, when I dumped the OTU table, all my OTUs were unique per-sample. There was no overlap of OTUs between samples. This seemed pretty clear that there must be some sample-specific barcode left in these reads.

So I went back and trimmed/truncated 20bp off both ends. Enough to ensure that the 2x8bp barcodes would be removed from either end. After doing that, the merging was great (~15% or less chimeras) and the taxa classifications were good (80-100+ per sample) and the OTUs had significant overlap between samples as we expected.

So my question here is what should I be using to screen my reads when making a decision of how much to trim? I had thought I was being careful by checking with FastQC and using a regex to search for the barcode sequences. But whatever was in there was not being seen by either of those methods. My goal is to have a robust method I can apply to determine the minimum amount I need to trim to clean my data and get good amplicon reconstruction. Any advice would be appreciated!

Thanks,
John Martin

Hi @jmartin

Personally,I would check one of the raw sequences file by myself ,not any software.

If you know the library is prepared by 2x8bp barcode and it is PE250 mode.It should have 8bp barcode in front of your R1 sequences, followed by primer F.Also another 8bp barcode in front of your R2 sequences, followed by primer R.

You can also check whether it still has barcode by the length.Since illumina PE250 would yield 251bp raw sequences in both R1 and R2.You probably recieved 243bp raw sequences in each read.Try q2-cutadapt if it still has barcode or just set --p-trim-f/r 8 in q2-dada2.

Also trunced low quality bases as much as you can in q2-dada2 because 16S V4 is only about 250bp and dada2 only required at least 12bp for overlapping.Miseq is not really good in base quality against other new platform such as Novaseq.

1 Like

Hi @jmartin,
Just wanted to add one thing to @sixvable’s fantastic answers.
Something I like to do in every run before denoising is running q2-cutadapt trim-paired set to search for my primers and setting the --p-discard-untrimmed flag on. This does a couple of things I like a) will get rid of my primers and everything that comes before it, so I know there is no chance of any non-biological reads being left in my reads (including in your case primers). Getting rid of primers is especially essential when they are degenerate primers, otherwise this can interfere with DADA2 error model and in fact inflate number of ASVs being called. b) Discards any reads that don’t have our primers in them, which to me makes sense, because those reads are very likely an artifact of the sequencing, bleed through, or some other weird contamination.

1 Like