question about primer trimming in qiime2 16S Miseq tutorials

Greetings Qiime2 community, :laptop: :doughnut:

I am running qiime2 amplicon 2026.1 in a conda environment.

I am new to qiime2, and learning a lot using the great tutorials to analyze mock libraries before analyzing our lab's experimental 16S data.

Currently I am analyzing a Miseq mock library. My understanding of Miseq read structure is that the PCR primers used to generate the amplicons are directly adjacent to the insert (grey), and the Illumina indices, sequencing primers, etc are distal to the insert, as shown in this image:

To analyze sequencing data, I could trim the PCR primers used to generate the amplicons (such as 515f and 806r) and that would remove the sequencing primers (orange and blue), the indices (green), and the adapters (?) in red and black.

As indicated in this q2 link, it would be important to remove non-biological sequence by trimming before analyzing reads. That makes sense to me, because taxonomy classifiers would frequently misclassify reads that include non-biological sequence.

My question is about primer trimming in the qiime2 16S tutorials. I'm afraid I'm missing something obvious, due to my inexperience with bioinformatics / q2. The trimming suggestions in the tutorials seem like they would be leaving a lot of non-biological sequence (untrimmed primers/adapters/etc).

For example, in the gut-to-soil tutorial: "16S rRNA gene was amplified using the F515-R806 primers .... Paired-end sequencing was performed on an Illumina MiSeq". We are recommended to use the quality scores viewed from demux.qzv to choose trimming/ truncation, and the suggested levels are as below:

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demux.qza \
  --p-trim-left-f 0 \
  --p-trunc-len-f 250 \
  --p-trim-left-r 0 \
  --p-trunc-len-r 250 \

This would leave a lot of non-biological sequence on the 5' end of the forward and reverse reads, if I'm understanding correctly.

I checked some of the other tutorials to see if I could learn more suggestions about primer trimming with Miseq data.

The Atacama soil tutorial also uses Miseq data, and suggests only trimming the first 13 nt on the 5' ends, which is insufficient to remove all non-biological sequence, if I'm understanding correctly.

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demux.qza \
  --p-trim-left-f 13 \
  --p-trim-left-r 13 \
  --p-trunc-len-f 150 \
  --p-trunc-len-r 150 \

The Parkinson's mouse tutorials also use 16S Miseq data, and does not trim any 5' primer sequence. The Moving Pictures tutorial uses HiSeq data, which according to the Illumina image link above, has the same read structure as MiSeq. The tutorial also does not trim any nucleotides from the 5' end of reads.

All of the tutorials are using DADA2 during the read truncation step, and the sequences have already been demultiplexed. Is it possible that indices etc are trimmed during the demux step when raw data is imported into q2? I see that if cutadapt is used to demux, then primers are automatically removed, but it seems like none of the q2 tutorials I looked at are using cutadapt to demux. Also, the DADA2 16S tutorial indicates that data should already have primers removed, so I would assume that DADA2 is not automatically removing primers during the denoising step in the q2 tutorials either.

I could use grep to check my raw seq data before and after importing to check for primer seq etc, but it would be reassuring to have some experienced folks chime in (ha :qiime2:) so that I don't just rely on my own command line experiments to make sense of things.

I see lots of folks in the forum using cutadapt in qiime2 in order to trim primers in a way that makes more sense to me, for example:

qiime cutadapt trim-paired
  --i-demultiplexed-sequences demux.qza \
  --p-cores 4 \
  --p-front-f GTGYCAGCMGCCGCGGTAA \
  --p-front-r CCGYCAATTYMTTTRAGTTT \
  --p-match-adapter-wildcards \
  --p-match-read-wildcards \
  --p-discard-untrimmed \
  --o-trimmed-sequences demux_trimmed.qza \

This would remove the amplicon PCR primers like 515f and 806r, and also the distal sequences mentioned above, leaving only biological sequence for downstream steps.

My questions are:

  1. Does the process of importing/demuxing of samples into q2 as shown in the tutorials (e.g. NOT demux using cutadapt) also remove common Illumina adapters and I just didn't realize it?
  2. Is that why in the tutorials, the demuxed reads are not trimmed, and primer sequence removal is not described?
  3. If above is true, then why are folks advised to also cutadapt to remove primers from their data that has been imported and demuxed by q2, for example here? Seems like it should be one or the other; not both.

Thanks for any info or suggestions you might have! :folded_hands:

Hi @sibilant,

In brief it depends on the sequencing protocol used. That is, does the sequencing step "read through" the PCR primers or not. Several of the tutorials, use the protocol that does not read through the PCR primers, hence no need for cutadapt. Other methods do read through the PCR primers, and thus cutadapt should be used.

Some may opt to trim off the primers using positional information, others use cutadapt. I prefer cutadapt, as there can be cases where there might be an error early on in the sequence where an indel occurs. This can cause the resulting sequence to be a base longer or shorter than it should be. Potentially, altering the denoising steps down stream. So, I prefer to use cutadapt for a more dynamic and clean removal of the primers. I also use cutadapt as a form of quality control, that is, if I am unable to find a primer in either of the reads, I discard the read pair.

See this thread.

I'd also suggest using the --verbose option when running cutadapt. That output will show you how many of your read are trimmed. If the primer pair is present, then 90% + of your reads should be trimmed. Otherwise there will be close to no trimming, aside from spurious trimming.

-Mike

2 Likes

Thanks very much @SoilRotifer! :dna:

I realized I must be missing some critical information, because it makes no sense to me how there can be no primer sequence at the 5' end of a raw read, regardless of read-through or not.

I thought maybe the reason MiSeq data does not include primer sequence (unless there's read-through) is because the Illumina platform only records incorporation of fluorescent nucleotides in the fastq data that are created. The primers themselves are made with untagged, non-fluorescent bases, and therefore are not represented in the corresponding fastq files (unless there's read-through).

See for example the last panel labeled 'Sequencing' in this image, from this link:

In the 'Sequencing' panel, the grey and black empty boxes representing sequencing primers are not fluorescent, and so are not recorded in the 'Signal scanning' used to generate the digital files.

Maybe because I'm primarily a wet-lab biologist, I wasn't making a distinction between the sequence of literal DNA molecules in the tube of a sequencing reaction, and the digital DNA sequence file created from those molecules.

But, if that were the case, then why for example are there so many primers removed during the DADA2 ITS tutorial, that are not primer read-through?

For example, we have this quality-control output before trimming primers:

We can see that there are 3743 and 3590 examples of expected primer read-through; I understand this is typical for variable-length ITS sequence.

However, we also see 4214 and 4200 primers that are not due to primer read-through. This is also a MiSeq dataset. These are the Forward primers in the Forward read, and Reverse primers in the Reverse read, if I'm understanding correctly. They would not be fluorescent in the DNA molecules, yet they are still represented in the digital fastq files.

I must be missing some additional piece(s) of information to explain the presence of the non-read-through primers here. Because otherwise, all the q2 tutorials are including primer sequence in downstream analysis, as shown in the DADA2 screencap above. Whereas, you were writing above that there are no primers for these sequencing protocols, because only read-through creates primer seq in the data.

My questions are:

  1. wouldn't all MiSeq sequencing reads have forward primers in the forward reads, and reverse primers in the reverse reads, at the 5' ends, as shown in the DADA2 screencap 4214 and 4200 columns above?

  2. If so, wouldn't these primer sequences need to be trimmed before analysis?

Hopefully the apparent contradiction I'm trying to resolve is clear:
either the q2 tutorials are correct in that no 5' end primer trimming is needed, because the primer seq doesn't exist in the digital files (not fluorescent?). Or, the DADA2 ITS tutorial is correct, that these 5' end primers do exist in all sequencing reads regardless of read-through, and should be removed. They can't both be correct, is my understanding.

Maybe the critical piece(s) of info I'm missing are related to this comment from one of your previous helpful posts:

This is assuming the sequencing protocol used sequences through the primer, i.e. is present in the 5' region of your reads.

My understanding is that the structure of un-merged reads with NO read-through would be as follows:

Forward reads:
5' -- Forward seq-primer -- DNA insert --- some noisy N's; no R primer seq -- 3'

Reverse reads:
5' -- Reverse seq-primer -- DNA insert -- some noisy N's, no F primer seq -- 3'

Structure of un-merged reads that DO have primer read-through:

Forward reads:
5' -- Forward seq-primer -- DNA insert --- RevComp of Reverse seq-primer -- 3'

Reverse reads:
5' -- Reverse seq-primer -- DNA insert -- RevComp of Forward seq-primer -- 3'

So, wouldn't read-through be at the 3' end of reads (the RevComp), not the 5' end, like your quote above?

Thanks again for any suggestions or info! :teapot:

Hi @sibilant,

I think there are a few points of confusion. If you read the EMP documentation, and associated publications, you'll see that the Illumina sequencing primers, in this case, were specifically modified to be complimentary to the V4 primer sequence of the amplified DNA. Thus the primer sequence itself is never sequenced through, as the first sequenced base is the first base after the primer sequence. Thus, will not be in the FASTQ output. This is why you can sequence the V4 amplicon with the 2x150 kit.

Most people use generic approaches, which sequence through the PCR primer, because the generic Illumina sequencing primer is used.

The other aspect of this, is read-through... which normally implies that the amplicon is very short... and you are reading all the way through into the reverse compliment of the primer adapter at the 3' end. This is why you may detect the primers on the 3' end but not the 5' end. This happens quite often with ITS, and other short variable length genes.

Often I will run cutadapt twice in this case. Once with the discard untrimmed option when removing the 5' primers. Then again to remove the reverse compliment of 3' primers w/o discard untrimmed. As not all reads will have read-through to the reverse compliment on either the 5' or 3' end.

-Mike

Thanks again!

I understand about the sequencing primers annealing to the PCR primer sequence; it's similar to what I was saying about the sequencing primers not being made from fluorescent nucleotides, and therefore not represented in the fastq file.

Most people use generic approaches, which sequence through the PCR primer, because the generic Illumina sequencing primer is used.

Ok I think it's coming together: are you saying that in the case of the three q2 tutorials I mentioned above, they are using the EMP sequencing primers, whose ends line up exactly with the ends of the PCR primers, and therefore with this combo of sequencing primers and PCR primers, there's no primer seq at the 5' ends of untrimmed reads?

Whereas with the 'generic' Illumina sequencing primers, it anneals distally to the PCR primers, and therefore the PCR primer is the 5' sequence of all raw reads? An example would be the DADA2 tutorial, which in the screencap above shows the non-read-through primers at 5' ends of For and Rev reads.

I appreciate your patience and help as I try to sort this out!

I guess I am confused as to what you mean by "non-read-through primer"? Given the tutorial you link to, these are standard Illumina sequencing outputs were the primers are read through and in the reads. This is even clearly expected in the tutorial:

As expected, the FWD primer is found in the forward reads in its forward orientation, and in some of the reverse reads in its reverse-complement orientation (due to read-through when the ITS region is short). Similarly the REV primer is found with its expected orientations.

That tutorial also uses cutadapt to remove them. Which is what we can do in QIIME 2 also. So, I am not entirely sure what is being asked here.

Also, keep in mind the specific primers being used has nothing to do with the sequencing protocol used. You can use the same V4 primers for either the EMP sequencing protocol that does not read through the primers, and a generic protocol that does read through the primers. There are myriad sequencing protocols, many have nothing to do with the primers selected.

Also, I think you are confusing two different sets of workflows. The DADA2 / R specific tutorials, and the QIIME 2 tutorials. There are not always going to be complete parity / overlap on "how to do things" across different tools. For example, not all functionality from DADA2 is necessarily available in QIIME 2.

Aw shucks it looks like you replied to a previous draft of my q. I didn't edit fast enough. Sorry about that!

Here's my current understanding below; I think I've got it now. Could you let me know?

Thanks again!

I understand about the sequencing primers annealing to the PCR primer sequence; it's similar to what I was saying about the sequencing primers not being made from fluorescent nucleotides, and therefore not represented in the fastq file.

Most people use generic approaches, which sequence through the PCR primer, because the generic Illumina sequencing primer is used.

Ok I think it's coming together: are you saying that in the case of the three q2 tutorials I mentioned above, they are using the EMP sequencing primers, whose ends line up exactly with the ends of the PCR primers, and therefore with this combo of sequencing primers and PCR primers, there's no primer seq at the 5' ends of untrimmed reads?

Whereas with the 'generic' Illumina sequencing primers, it anneals distally to the PCR primers, and therefore the PCR primer is the 5' sequence of all raw reads? An example would be the DADA2 tutorial, which in the screencap above shows the non-read-through primers at 5' ends of For and Rev reads.

I appreciate your patience and help as I try to sort this out!

:+1:

Looks like you got it! :+1:

That's why we're here. :slight_smile:

Ahh, finally! :raising_hands: Thank you so much!

It would have taken me an age to get to the bottom of the "missing primer trimming step" in the q2 tutorials without your help today. I hadn't heard about the special EMP primer design; that's a cool feature. I really appreciate it! :seedling: :orange_heart:

1 Like