Best workflow for samples with high off-target (host) amplification

owlpen · June 27, 2025, 6:13pm

Hi all,
just wondering if anyone has experienced anything similar. At present I'm wishing I had a time machine and could go back and select a different region of 16S to target.

I'm working with boar semen and the sequencing company reported multiple peaks in tapestation assay for a large amount of my samples. This has not been reported to me before - but on this occasion I'm using a different company because I wanted to try out sequencing on a MiSeq rather than Novaseq, and this company also provides all intermediate data whereas the previous one treated everything as proprietary and wouldn't share data except final raw reads.

The issue is now that, as QC suggested, many samples contain large proportion of reads that align to the boar genome. So, what I'm wondering is what the best path is from here in terms of workflow. In order to get the best possible alignment results with the boar genome, I performed quality trimming with Trimmomatic plus removal of adapters and polyA/polyG tails with bbmap.

I'm wondering if, as the reads are already trimmed, that this will interfere with the dada2 denoising. My attempts with dada2 has seen many reads either lost at initial filtering step if I am not aggressive enough with filtering of the reverse read, but the main issue is that I'm losing most reads at chimeric stage even now post removal of host reads.

Deblur is performing much better for me, and for now I'm going to proceed with this analysis, but I was just wondering if anyone else has ever experienced anything similar or has any wisdom to impart to a beleaguered PhD student with 3 months left to finish everything.
Sigh...

colinbrislawn · June 27, 2025, 6:27pm

Hello Richard,

I've been there. Switching vendors can be hard, and also very necessary.

Out of all the challenges, the core problem is surprising itself!

The whole point of targeted amplicon sequencing is that the PCR primers are specific enough that off-target effects are minimized. To get lots of hits to a mammal from 16S primers is surprising!

What primers did they use? Is this a order mixup with the vendor and they gave you shotrun / untargeted /PCR-free sequencing service?

In order to get the best possible alignment results with the boar genome, I performed quality trimming with Trimmomatic plus removal of adapters and polyA/polyG tails with bbmap.

That sounds like a metagenomic method to me!

SoilRotifer · June 27, 2025, 7:39pm

Hi @owlpen,

I would not be surprised that you are amplifying mostly boar host DNA given that you are working with boar semen. Primers, in PCR, often amplify whatever they can, if there is nothing better to bind to... I've had a case in which, the V4 16S rRNA primers amplified 12S rRNA gene sequences, quite well, some samples were dominated by them and hardly any 16S rRNA etc...

Given my experience I noted above. I wonder how many of these mapped reads are actually bacteria? Probably not many, but could be worth investigating. Also, what are you using as your reference database to assign taxonomy? I would also like to know what variable region you are targeting.

It shouldn't. At least you should be preparing your data so that only the amplicon sequence is present. That is, the PCR primers and adapters, etc.. should be removed prior to denoising. The common approach is to use cutadapt to remove the PCR primers from the reads (if the sequencing protocol sequences through the primers). Then send to DADA2, etc...

If you had a time machine :

Often, to mitigate off-target amplification, you can make use of touch-down PCR. that is, you can start with an annealing temperature that is ~ 8-10 C higher than your target. Then drop 1 C each cycle until you hit your target. Then go for another 20 cycles or so. The idea is that your ideal targets are more likely to preferentially bind to your primers at higher temperatures, thereby pre-amplifying those targets. Reducing off-targets. This does not always work...
The more common thing to do, if you are unable to change primers, is to use blocking primers or peptide nucleic acids to block host DNA amplification. See this post.

owlpen · June 28, 2025, 2:10pm

Hi @SoilRotifer ,

thanks for the detailed response.

Honestly, it's really good to hear that you've experienced this issue before and that I'm not the only one. Some of my reads look fine, and these correspond with the tapestation assay results, so hopefully I will have enough data of some sort in there - it is all grist for the mill in the end!

I am going to be classifying using the pre-trained classifiers supplied here - V3-V4. I've used the RESCRIPT plug-in before, and it was v useful. I definitely want to explore every aspect of these boar aligned reads. At present I have used bowtie2 and samtools to align and then sort the unmapped reads and was going to use these going forward with Qiime. However, I will also be investigating those mapped reads further and seeing if there is a common region(s) of the boar genome that is cropping up. It's definitely worth seeing if any of those align to anything in the rRNA databases too.

Yes - as in my reply to Colin above, I've attempted to do this with cutadapt, as much as I'm able to until I get those exact primer sequences. It was the deblur stats, particularly those high fractions of reads that missed the reference that prompted me to explore this in more detail.

Your points about touchdown PCR or blocking primers, both of which are new concepts to me, has sent me off on an investigation into these - so thanks! It makes me wonder if I have traded experience for transparency in the sequencing provider. I will get in touch with the old company to ask them if they use any of those techniques and hope that they are forthcoming.

As always, thanks so much for the responses. This is a marvellous community and I appreciate your time and interest.

All best,
Richard

owlpen · June 28, 2025, 2:10pm

Hi Colin,
thanks for the response.

The primers are supposed to be amplifying the V3-V4 region of 16S. It is a surprising result to me, in that I've requested this region before, in my first ever pilot study of similar samples and this wasn't an issue then.

Looking back at my dada2 and deblur denoising stats from that run, I didn't have any problem with high percentage of chimeric reads, nor did I see that a high fraction of reads missed the Greengenes reference in the Deblur stats. Both of these issues are occurring this time. The main issue on that occasion, was that the overlap wasn't big enough as the NovaSeq only generated 250bp reads.

This paper Non-specific amplification of human DNA is a major challenge for 16S rRNA gene sequence analysis had alerted me to a potential issue when I move to human samples, but as I said, given my previous experience I wasn't expecting to see this with boar, and certainly not to this degree.

You are right, this is definitely a shotgun workflow, but I'm just wanting to be sure that I don't end up falsely classifying boar reads as bacterial, so am at the experimentation stage before deciding on a specific workflow from here. In my work so far, I've found that semen is low in microbial biomass, although not as low as some niches, due to passage through the urethra. However, I've done some work with qPCR where I've tried to calculate the ratio of host to bacterial DNA in samples (as part of optimising a host depletion methodology), and the ratio is about 99:1!

I am awaiting final confirmation on which specific primers were used. Slightly annoyingly this information wasn't provided to me on release of my data, despite these not being trimmed. At present I am just trimming the first 20 bases from the 5' ends until I get confirmation of sequences.

Lastly, I'm pretty sure this is not a mix up as my order contains some shotgun and some RNAseq samples too - part of a methodological comparison.

Many thanks,
Richard

owlpen · July 4, 2025, 8:24pm

Hi, a quick update from me.

My primer sequences were provided to me this Weds. I went back and ran the reads through cutadapt and then performed the same bowtie2 alignment.
The proportion of reads where primers are untrimmed correlates with the proportion of host genome alignment - so everything is pointing towards off target amplification.
So, ultimately I have some samples with v low sequencing depth, which I will probably remove downstream, but still some good ones to work with.
Dada2 now works very well - I lose most reads at initial Q filtering, and very few to merging and chimeras. So all in all, feeling happy about that.

Despite these issues, the MiSeq platform (and potentially the new provider) have served me well in terms of the prevalence of contaminant sequences. These were a huge headache for me in the past, but this time their prevalence appears to be very low according to my blanks, my cell-based dilution series, and the decontam results.

I have to say, I find this part of the process, data-wrangling, the most enjoyable. Yet it is so difficult! I feel so aware of all the things that can go wrong in this field, I'm wondering how it is possible to ever feel confident about my results!

That's my cry into the void done for today. Thanks, R.

colinbrislawn · July 8, 2025, 12:19am

Same, lol

When I worked for a clinical startup, we tested extensively with positive controls with a known composition to establish a 'level of detection' and 'level of blank' for specific taxa.

I strongly recommend positiven controls.
GitHub - caporaso-lab/mockrobiota: A public resource for microbiome bioinformatics benchmarking using artificially constructed (i.e., mock) communities.

owlpen · July 8, 2025, 9:48am

Hi @colinbrislawn ,

thanks for the tip. I used positive controls in this run. The DNA based controls worked really well in terms of expected composition, but my cell based one was terrible. Problem now is that it is difficult to know if it is the mock that is at fault here or the DNA extraction protocol (which I spent a long time optimising in the past). I hadn't realised until the other week that mocks can degrade in quality over time - mine has been sitting in a freezer since I made a batch a couple of years ago! Still I have one more chance in the next three months to try with a fresh mock in a mini pilot working with human samples.

The other issue with this cell based mock is that one of the community members, and the associated ASV. was also detected in the negative control at quite high abundance, plus in several other biological samples. This ASV is extremely over-expressed in the cell based mock and is the main cause of the skewed composition, so it is really tricky to know whether this is the impact of a contaminant sequence or just a poor quality mock! It wouldn't be out of place in the biological samples either, so I guess it means I just have to report the findings with several caveats.

What did work for me this time, was (as you suggested above) establishing the lower limit of detection - in that one of the DNA mocks had a logarithmic distribution, and for the cell based mocks I used a dilution series. This was very helpful in identifying the impact of contaminant reads (aside from the issue above, quite low in this run) and also lower limit of detection. This has resulted in my decisions of where to set my "prevalence filter" for the dataset feeling much less arbitrary. It has taken me 4 years to get to this point though! Time machine wishes again!

Thanks again for the comments - I'm enjoying it.