Issues troubleshooting sequencing quality - missing primers and ambiguous bases in reverse reads

sr23 · November 6, 2023, 4:30pm

Hello,

I have been working with a set of environmental samples that we submitted to a sequencing facility for Illumina MiSeq sequencing. We sequenced the V4 region using 515F and 806R primers. Since receiving the dataset, we have run into a number of odd things in the dataset.

The forward and reverse primers are only present in about 72% of sequences (searched for in BioStrings), which forces us to remove a large portion of the sequences
Several of the R2 reads have ambiguous base calls at around 22 bp, right after the reverse primer from 1-20. These same reads also seem to have high homoplasy. When I work through the mothur workflow to screen sequences (with screen.seqs command), it removes about half of my sequences when max ambiguity = 0 and max homoplasy = 8.

We confirmed with the sequencing facility that the forward and reverse primers should still be present, but we are unsure how to troubleshoot why only 72% of sequences have primers. Additionally, it is unclear at what step in the process caused several R2 reads to have this ambiguous base. We are thinking it might have something to do with library prep, but I have not been able to find information on the internet that indicates what exactly causes this.

Thank you!

SoilRotifer · November 6, 2023, 7:22pm

Have you tried using cutadapt to find and remove your primer sequences form paired-end data? I find that it has a better approach to handling mismatches etc. I'd suggest using the --p-discard-untrimmed flag too.

I'd suggest allowing for ambiguity in your sequences. I think not allowing for ambiguity (i.e. "max ambiguity = 0") is a problem. Remember the 16S rRNA v4 primer sequences do contain IUPAC ambiguity codes, which means there will be variability in the primers within the resulting sequences. Also you need to take into account any minor sequencing errors too. Cutadapt, and the corresponding :qiime2: plugin, can handle matching to these IUPAC ambiguity codes.

Give cutadapt a try, or at least relax your mac ambiguity setting. By default cutadapt allows for an error rate of 10% for primer matching.

sr23 · November 6, 2023, 9:37pm

Hello,

Thanks for the response - I don't think this is the issue exactly. I have already used cutadapt to remove the primers before running the mothur workflow and it seems like there are still a number of ambiguous bases according to screen.seqs (besides, shouldn't the sequences themselves not be ambiguous even if the primer can be?). My larger question is about why only 72% of my sequences contain primers and what may have caused these ambiguous bases to appear? It may also be related to PCR and library prep.

SoilRotifer · November 7, 2023, 12:51am

Ahh okay, this was not mentioned in your initial post. Also, to be clear I was only referring to the mismatches / max ambiguity in the context of primer finding and removal with cutadapt, not quality control. Sorry for the confusion.

Did you set discard untrimmed with cutadapt? That is if you do not do this, you'll retain sequences with extra sequence that spans the primer location, and contribute to ASV inflation by keeping this extra sequence. Assuming you did, this is indeed odd.

Quite strange, I'm not sure why there would be any ambiguous bases to begin with. Especially in the R2 read. Did the sequencing facility process these reads in any way, or are these raw data from the machine?

I think you mean that you have many reads with homopolymers? Normally this is an issue with 454 sequencing not Illumina, unless the sequences themselves actually have homopolymer stretches, which I'd not expect. Or it might be a PCR or other issue as you've suggested.

This is likely not the reason, but I figure I'll ask the obvious question. Are you searching for the correct primer sequences? That is, the updated version of the V4 primers:

515F (Parada)–806R (Apprill), forward-barcoded:
FWD:GTGYCAGCMGCCGCGGTAA; REV:GGACTACNVGGGTWTCTAAT

I'm sure you are, just figured I'd ask.

To be clear, this is based on manual visual inspection of the raw FASTQ files themselves? Not summary output, correct?

I feel bad that I am unable to help you. This is certainly perplexing.
Hopefully, others have better insight than me.