Questions about 16S data from Novogene (UK)

Hello @KQUB,

So, clearly bits of both adapters are in the forward reads and in the reverse reads. I guess I should try running cutadapt on forward and reverse reads with both of the "adapter trimming sequences" ... ... each time, then re-do the count and see if they've been successfully removed.

Cutadapt will trim both read directions at the same time. The trim-paired action will allow up to four sequences to be searched for: for both read directions, a sequence at the 5' end and a sequence at the 3' end. You should confirm this, but I think that cutadpt will look for the reverse complement of these sequences as well. You can pass multiple sequences to each of these parameters at once, but remember to use the --n-times parameter and to read the cutadapt docs because it gets confusing.

If I do this with cutadapt , all parts of both adapters will be removed from all reads simultaneously, right? So, I don't have to then look for and remove barcodes separately?

Since the barcodes are downstream of the adapter trimming sequences, all of the barcodes should be removed as well afterwards.

Dada2 will remove the chimeras, yes. Best of luck with the rest of your analysis!

1 Like

Hi again, @colinvwood,

Some more questions (hopefully among the last)!

I think I have two clear (logical) objectives for the cutadapt step:

  1. Remove all reads with Ns (i.e. unknown nucleotides). I've seen that some of my reads have Ns, and I know that DADA2 can't handle Ns, so those reads should be removed.
  2. Remove the adapters we might logically expect to see in some of the reads as a result of sequencing through to the other side of a short DNA insert (i.e. removing the P7 adapter from forward reads, and the P5 adapter from reverse reads; or, according to this figure, removing everything downstream of the DNA insert from the forward reads, and removing everything upstream of the DNA insert from the reverse reads):
    image

Given that the V3–V4 primers appear to have already been removed from the ends of my "raw" sequences, one might imagine that fulfilling the two above objectives would be enough for me to then move on to denoising with DADA2; because having done those, I would have reads that are free from the V3–V4 primers, free from the adapters, and free from Ns. But although the bit about removing Ns will be true, the rest won't, actually.

I have seen that some of my forward reads (illogically, I think) contain bits of the P5 adapter, and some of my reverse reads contain bits of the P7 adapter (see my previous post in this conversation). In addition, some of my reads contain the V3–V4 primers in the middle of them (see my first post in this conversation). Those two kinds of reads don't make sense to me, based on my current understanding of Illumina sequencing. Would you agree that those kinds of reads don't follow expected behaviour? If so, are they most likely the result of artefacts arising from the library preparation or sequencing processes? And lastly, is there any way I can just dump these reads, or salvage some useful info from them, or how should I deal with them?

Thanks, as always, for your time! :blush:

Hello @KQUB,

  1. Remove all reads with Ns (i.e. unknown nucleotides). I've seen that some of my reads have Ns, and I know that DADA2 can't handle Ns, so those reads should be removed.

Dada2 takes care of this for you, no need to do this yourself.

I have seen that some of my forward reads (illogically, I think) contain bits of the P5 adapter, and some of my reverse reads contain bits of the P7 adapter (see my previous post in this conversation). In addition, some of my reads contain the V3–V4 primers in the middle of them (see my first post in this conversation).

The simplest thing to do is to search for both adapter prefix sequences and both primers in both of your read directions.

Those two kinds of reads don't make sense to me, based on my current understanding of Illumina sequencing. Would you agree that those kinds of reads don't follow expected behaviour?

Regarding the opposite adapter situation, yes. I wouldn't necessarily have been surprised by some exceptions, but I wouldn't have expected them in the frequencies that you shared.

Regarding amplicon primers in the middle of sequences, that is not that uncommon (primer dimers, chimeras, non-specific amplification, short amplicons, ...).

If so, are they most likely the result of artifacts arising from the library preparation or sequencing processes?

Again, regarding the adapter issue, I think so, unless there is some lack of understanding on my part.

And lastly, is there any way I can just dump these reads, or salvage some useful info from them, or how should I deal with them?

Letting the trimming tools do their thing will amount to salvaging as much as is possible. One could make an argument for discarding these types of reads completely however, because as you say, there is reason to believe they are artifacts of library prep, PCR, or sequencing. That is a subjective decision I suppose.

I wouldn't discard reads that contain amplicon primers in the middle of the sequences because there are 'natural' ways this can occur.

1 Like

Hello @colinvwood,

Back again with some more questions.

Okay, is this something specific to q2-dada2? Because the tutorial on the page for the DADA2 R package says "DADA2 requires no Ns" (see the 'Filter and trim' section here).

This sounds like a good approach!

Are there? How so? (Sorry if you've already mentioned this somewhere above.)

Lastly, can you recommend some reading material for me to help me get my head around all of this stuff? Where did you get all your knowledge about Illumina sequencing, sequencing artefacts etc.? For example, I'd love to know more about how things like primer dimers, chimeras, and non-specific amplification occur and how common they are. Do you know of any resources for that?

Thanks as always for your help! :blush:

Hello @KQUB,

Okay, is this something specific to q2-dada2 ? Because the tutorial on the page for the DADA2 R package says "DADA2 requires no Ns" (see the 'Filter and trim' section here ).

We wrap the same filterAndTrim function that's in that tutorial so it's done for you.

I wouldn't discard reads that contain amplicon primers in the middle of the sequences because there are 'natural' ways this can occur.

Are there? How so? (Sorry if you've already mentioned this somewhere above.)

All I was getting at here is that there is no reason to assume that the sequence upstream of a primer isn't true biological sequence. People generally conserve as much of a read as possible, and don't discard an entire read because it contains a non-biological subsequence.

Lastly, can you recommend some reading material for me to help me get my head around all of this stuff? Where did you get all your knowledge about Illumina sequencing, sequencing artefacts etc.? For example, I'd love to know more about how things like primer dimers, chimeras, and non-specific amplification occur and how common they are. Do you know of any resources for that?

This is one of those areas that to my knowledge doesn't have any good centralized source of information. The illumina knowledge website that you linked above is pretty useful. I've just sort of pieced things together from reading library prep protocols, working in a wet lab, this forum, the cutadapt docs and other scraps around the internet. Case in point, I learned from this discussion that adapters and adapter trimming sequences are not the same thing.

1 Like

Hi @colinvwood,

So, here is a summary of my observations so far:

  • Some of my F and R reads contain Ns (i.e. unknown nucleotides). (Doesn’t matter, q2-dada2 will remove these reads automatically – at least, that's my current understanding.)
  • Some of my F reads contain the V3–V4 F primer; not right at the 5' end of the reads, but nearer the 5' end than the 3' end. (Should be removed using qiime cutadapt trim-paired.)
  • Some of my R reads contain the V3–V4 R primer; not right at the 5' end of the reads, but nearer the 5' end than the 3' end. (Should be removed using qiime cutadapt trim-paired.)
  • Some of my F reads contain the read 1 adapter trimming sequence towards the 3’ end (not entirely unexpected); and some have the read 2 adapter trimming sequence towards the 3’ end (unexpected). (Should be removed using qiime cutadapt trim-paired.)
  • Some of my R reads contain the read 2 adapter trimming sequence towards the 3’ end (not entirely unexpected); and some have the read 1 adapter trimming sequence towards the 3’ end (unexpected). (Should be removed using qiime cutadapt trim-paired.)

Adapter removal

Since the read 1 and read 2 adapters can be found towards the 3' end of some F and R reads, I assume I should use the --p-adapter-f and --p-adapter-r parameters of qiime cutadapt trim-paired to remove them both from F and R reads (removing also all downstream bases).

Adapter trimming sequences:

TruSeq™ single index (previously LT) and TruSeq CD index (previously HT)-based kits:

  • Read 1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
  • Read 2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Removing adapters with qiime cutadapt trim-paired:

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux.qza \
  --p-adapter-f AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
  --p-adapter-r AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT  \
  --verbose > adapter_trim_output.txt \
  --o-trimmed-sequences demux_adapters_trimmed.qza

V3–V4 primer removal

As for the V3–V4 primers, since they are nearer the 5' end rather than the 3' end, I was thinking of using the --p-front-f and --p-front-r parameters of qiime cutadapt trim-paired to remove them (and all upstream bases).

Something like:

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux_adapters_trimmed.qza \
  --p-front-f CCTAYGGGRBGCASCAG \
  --p-front-r GGACTACNNGGGTATCTAAT  \
  --verbose > primer_trim_output.txt \
  --o-trimmed-sequences demux_adapters_and_primers_trimmed.qza

But then I read this comment:

Since the V3–V4 F primer is present in some of my F reads near the 5' end, but not exactly at the 5' end, and the V3–V4 R primer is present in some of my R reads near the 5' end, but not exactly at the 5' end, I would have thought it best to remove the primers and all upstream sequence (assuming that the sequence downstream of the primer might be biological sequence from the 16S rRNA gene). Is this wrong? What's your understanding here? You're saying keep the sequence upstream of the primers by using --p-adapter-f and --p-adapter-r to remove the primers?

Quality trimming at the 3' end

I was thinking of also doing some quality trimming at the 3' ends of the reads.

Here are my raw (binned) quality plots [NovaSeq 6000 data]:

Our data seem to have only four possible Q-scores (2, 11, 25, 37):

  • 2 = #
  • 11 = ,
  • 25 = :
  • 37 = F

Would it be good to add a --p-quality-cutoff-3end 3 parameter to the qiime cutadapt trim-paired command to trim all nucleotides with Q-score of 2 from the 3' ends of the reads? Or should I even remove the nucleotides with Q-score 11 too, by using --p-quality-cutoff-3end 12?

I guess I could instead do this at the qiime dada2 denoise-paired step, using the --p-trunc-q parameter.

My attempt at a single qiime cutadapt trim-paired command to do all of this simultaneously

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux.qza \
  --p-front-f CCTAYGGGRBGCASCAG \
  --p-front-r GGACTACNNGGGTATCTAAT \
  --p-adapter-f AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
  --p-adapter-r AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
  --p-quality-cutoff-3end 12 \
  --verbose > cutadapt_output.txt \
  --o-trimmed-sequences demux_adapters_and_primers_trimmed.qza

Am I on the right track? You mentioned an --n-times parameter earlier in this thread (I suppose you mean --p-times). Do I need to include that here? And if so, why?

Dealing with short reads

I also thought about adding a --p-minimum-length parameter (maybe --p-minimum-length 180) to remove very short reads. I seem to have a small number of very short reads (shortest F read length = 24 nt, shortest R read length = 31 nt) in my data. Is it worth removing those, or will q2-dada2 deal with them?

Thanks in advance for any advice you can give, @colinvwood! I am learning a lot here. I think you have earned an acknowledgment on my thesis at this point.

Hello @KQUB,

Since the read 1 and read 2 adapters can be found towards the 3' end of some F and R reads, I assume I should use the --p-adapter-f and --p-adapter-r parameters of qiime cutadapt trim-paired to remove them both from F and R reads (removing also all downstream bases).

Be careful when passing multiple trimming sequences to cutadapt at once, see this post for an example of why. In this case it shouldn't really matter because it would be exceedingly strange to see two adapters in a single read, but I would probably still do them separately or use the --p-times option.

Since the V3–V4 F primer is present in some of my F reads near the 5' end, but not exactly at the 5' end, and the V3–V4 R primer is present in some of my R reads near the 5' end, but not exactly at the 5' end, I would have thought it best to remove the primers and all upstream sequence (assuming that the sequence downstream of the primer might be biological sequence from the 16S rRNA gene).

Yes you are right. The upstream comment I made was referring to the adapters, which you only expect to see at the 3' end. At the 5' end you remove the preceding sequence not the subsequent sequence as the cutadapt help text says.

By the way I would run the primer trimming before the adapter trimming, not the other way around. This is because you expect the primers to be nested within the adapters (see the diagrams of the library fragments above), so trimming primers will in most cases also trim the adapters.

I was thinking of also doing some quality trimming at the 3' ends of the reads.

Those sequences do not look like they need any quality filtering. Even if the quality were lower towards the end of the sequences, you could parameterize the trimming positions to dada2, not dedicate an entire extra step to only quality filtering.

This is possible, but dada2 takes quality scores into account in its algorithm, so this would mostly be personal preference.

I also thought about adding a --p-minimum-length parameter (maybe --p-minimum-length 180 ) to remove very short reads. I seem to have a small number of very short reads (shortest F read length = 24 nt, shortest R read length = 31 nt) in my data. Is it worth removing those, or will q2-dada2 deal with them?

Dada2 will deal them by discarding all reads shorter than the truncation length you provide for each read direction.

2 Likes

Hi, @colinvwood,

Good advice, it seems. I also found some other forum posts where the solution was to run two separate cutadapt commands: here, here and here. I will probably do the same, then.

I'm not sure I follow. I have been talking about removing the V3–V4 primers (a.k.a. the amplicon primers, the PCR primers) from near (but not at) the 5' ends of my reads. As far as I understand, these primers are not parts of the adapters. They should instead be part of the DNA insert shown below. The sequencing primers (or sequencing primer binding sites; i.e. Rd1 SP and Rd2 SP) are parts of the adapters, but the V3–V4 primers aren't, right? My understanding is that it therefore doesn't matter whether the adapters or V3–V4 primers are trimmed first.

image

Is that true? What makes you say so? I've read that quality plots for NovaSeq data tend to look different to, for example, quality plots for MiSeq data because of the binning of quality scores (a.k.a. Q-scores): other examples here, here, here, and here. In other words, it's my understanding that the Q-score for a given base is an average of many grouped bases. Therefore, the "actual" Q-score for a given base could be lower than expected. I'm no expert on this, though.

Whatever the case, I guess we can only work with the Q-scores we have. Here are some quotes from an Illumina PDF about NovaSeq™ 6000 System Quality Scores:

"A Q-score of 30 (Q30) corresponds to a 0.1 percent error rate in base calling, and is widely considered a benchmark for high-quality data."

"The three groups in the quality table correspond to marginal (<Q15), medium (~Q20), and high-quality (>Q30) base calls, and are assigned the specific scores of 12, 23, and 37 respectively.* Additionally, a null score of 2 is assigned to any no-calls."

The Q-scores for our data (i.e. 2, 11, 25, 37) are slightly different than those described above (i.e. 2, 12, 23, 37), but the idea seems the same. Would it not make sense to trim all bases with Q-scores < 25 from the 3' ends of our reads, so that we are left with only medium and high-quality base calls at the 3' end?

Or are my quality-filtering ideas a waste of time? Because...

I know I could use the truncation function of qiime dada2 denoise-paired to trim bases from the 3' end of reads instead of doing any kind of quality-based filtering, but it seems like it could be trickier to chose the truncation values with NovaSeq data than with, for example, MiSeq data, because the quality plots from demux.qzv don't seem to give as clear of a indication as to where would be a good position to trim. Or what do you think?

Thanks as always for your time, @colinvwood! :blush:

Hello @KQUB,

I'm not sure I follow. I have been talking about removing the V3–V4 primers (a.k.a. the amplicon primers, the PCR primers) from near (but not at) the 5' ends of my reads. As far as I understand, these primers are not parts of the adapters. They should instead be part of the DNA insert shown below. The sequencing primers (or sequencing primer binding sites; i.e. Rd1 SP and Rd2 SP) are parts of the adapters, but the V3–V4 primers aren't, right? My understanding is that it therefore doesn't matter whether the adapters or V3–V4 primers are trimmed first.

This is all correct. When a read reads into an adapter one would generally expect that to look like this:

5' [16S gene] [amplicon primer] [adapter] 3'

So trimming primers first will remove the primer and the adapter in these cases. Trimming the adapter first will leave the primer. Ultimately the order probably wouldn't make a difference but it seems smarter to trim the primers first for this reason.

Is that true? What makes you say so? I've read that quality plots for NovaSeq data tend to look different to, for example, quality plots for MiSeq data because of the binning of quality scores (a.k.a. Q-scores) [...] In other words, it's my understanding that the Q-score for a given base is an average of many grouped bases. Therefore, the "actual" Q-score for a given base could be lower than expected. I'm no expert on this, though.

That is all my understanding too, however the 25th percentile of these averages never drops below 25 which is really not bad at all. You can look at other posts on this forum and see that people typically see much worse quality drop off towards the end of reads, especially in the reverse reads.

But the quality scores themselves aren't really the point. The more information provided to dada2 the more accurately it can perform its denoising algorithm. Even when people have very poor scores towards the end of their reads, the entire reads are passed to dada2, along with the desired truncation length.

I know I could use the truncation function of qiime dada2 denoise-paired to trim bases from the 3' end of reads instead of doing any kind of quality-based filtering, but it seems like it could be trickier to chose the truncation values with NovaSeq data than with, for example, MiSeq data, because the quality plots from demux.qzv don't seem to give as clear of a indication as to where would be a good position to trim. Or what do you think?

To me, even taking into account their binned nature, the quality scores don't indicate the need to truncate. You'll have to pay attention to your read length distribution going into dada2 however, because of the primer and adapter trimming steps.

Although it's good to think through this stuff, even better is to experiment with different parameters and see what gives you the best results.

2 Likes

Hi @colinvwood,

True, but I'm talking about something else. (Apologies if I wasn't clear.)

I'm talking about this situation, which I observed in some reads:

  • 5' [something] [V3–V4 forward primer] [16S gene fragment] 3' in some forward reads
  • 5' [something] [V3–V4 reverse primer] [16S gene fragment] 3' in some reverse reads

Examples (from the first post in this discussion):

I searched the sequences downstream of the primers with BLAST and got hits to 16S rRNA genes in each case. Makes sense.

The sequences upstream of the primers, I'm less sure about. Here are the BLAST results for them. (I got the same results with megablast and blastn):

F reads:

  • 100% identity to several 16S rRNA genes
  • No significant similarity found.

R reads:

  • No significant similarity found.
  • No significant similarity found.

Not clear to me what's going on here (maybe a few different things). Either way, I plan to remove these upstream fragments by trimming the V3–V4 F primer (and everything upstream) from the forward reads, and the V3–V4 R primer (and everything upstream) from the reverse reads, using qiime cutadapt trim-paired with --p-front-f CCTAYGGGRBGCASCAG and --p-front-r GGACTACNNGGGTATCTAAT.

Okay, I'll try DADA2 with no truncation then.

To ensure there's enough overlap for merging, do you mean?

Got it! I'll try some things now and see what happens.

Thanks for your help (and your patience with all my questions)!

Hello @KQUB,

Not clear to me what's going on here (maybe a few different things). Either way, I plan to remove these upstream fragments by trimming the V3–V4 F primer (and everything upstream) from the forward reads, and the V3–V4 R primer (and everything upstream) from the reverse reads, using qiime cutadapt trim-paired with --p-front-f CCTAYGGGRBGCASCAG and --p-front-r GGACTACNNGGGTATCTAAT

I'm also not sure what's going on there. Strange that some of the upstream sequences are 16S and some are non biological. Did you also check those upstream sequences against your adapter sequences?

To ensure there's enough overlap for merging, do you mean?

That and, if you do end up using a truncation value because it gives you better results, to be aware that all reads shorter than your truncation length are discarded. To know how many to expect to be discarded you have to look at the length distribution.

2 Likes

Hi @colinvwood,

Okay, so yesterday I used grep to do a pretty comprehensive search for the V3–V4 primers and the two adapter trimming sequences in my forward and reverse reads. I searched for each primer and adapter trimming sequence separately, and in every paired combination of "primer plus adapter trimming sequence".

V3–V4 primers:

  • F primer: CCTAYGGGRBGCASCAG
  • R primer: GGACTACNNGGGTATCTAAT

Adapter trimming sequences:

  • Read 1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
  • Read 2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Here's a summary of the results:

Separate searches:

  • F primer in F reads:
    — 18376 total hits across 17414 reads
    — 776 reads containing two or more copies of the F primer
    — 106 reads containing three or more copies of the F primer
    — 36 reads containing four or more copies of the F primer
    — 20 reads containing five or more copies of the F primer
    — 15 reads containing six or more copies of the F primer
    — 4 reads containing seven or more copies of the F primer
    — 3 reads containing eight or more copies of the F primer
    — 2 reads containing nine copies of the F primer

  • R primer in F reads:
    — 7 total hits across 7 reads (those 7 reads each contain one copy of the R primer)

  • F primer in R reads:
    — 11 total hits across 10 reads (1 read had two copies of the F primer)

  • R primer in R reads:
    — 11449 total hits across 9969 reads
    — 834 reads containing two or more copies of the R primer
    — 360 reads containing three or more copies of the R primer
    — 169 reads containing four or more copies of the R primer
    — 77 reads containing five or more copies of the R primer
    — 27 reads containing six or more copies of the R primer
    — 9 reads containing seven or more copies of the R primer
    — 4 reads containing eight copies of the R primer

  • Read 1 adapter trimming sequence in F reads:
    — 136 total hits across 136 reads (those 136 reads each contain one copy of the Read 1 adapter trimming sequence)

  • Read 2 adapter trimming sequence in F reads:
    — 109 total hits across 109 reads (those 109 reads each contain one copy of the Read 2 adapter trimming sequence)

  • Read 1 adapter trimming sequence in R reads:
    — 33 total hits across 33 reads (those 33 reads each contain one copy of the Read 1 adapter trimming sequence)

  • Read 2 adapter trimming sequence in R reads:
    — 31 total hits across 31 reads (those 31 reads each contain one copy of the Read 2 adapter trimming sequence)

Combined searches:

  • F primer and Read 1 adapter trimming sequence in F reads:
    — 16 reads contain at least one copy of both
    — Read 1 adapter trimming sequence always present in only one copy, whereas multiple copies of F primer sometimes present (max: 4 in one read)
    — F primer sequence(s) always upstream of the Read 1 adapter trimming sequence

  • F primer and Read 2 adapter trimming sequence in F reads:
    — 12 reads contain at least one copy of both
    — Read 2 adapter trimming sequence always present in only one copy, whereas multiple copies of F primer sometimes present (max: 3 in one read)
    — F primer sequence(s) always upstream of the Read 2 adapter trimming sequence

  • R primer and Read 1 adapter trimming sequence in F reads:
    — no hits

  • R primer and Read 2 adapter trimming sequence in F reads:
    — no hits

  • F primer and Read 1 adapter trimming sequence in R reads:
    — no hits

  • F primer and Read 2 adapter trimming sequence in R reads:
    — no hits

  • R primer and Read 1 adapter trimming sequence in R reads:
    — 1 read contains one copy of both
    — R primer upstream of the Read 1 adapter trimming sequence

  • R primer and Read 2 adapter trimming sequence in R reads:
    — no hits

Now, to answer your question...

Answer: In my reads, no adapter trimming sequences are ever found upstream of either one of the V3–V4 primers.

So, how to proceed...

If both the F reads and R reads contain hits to the F primer, R primer, Read 1 adapter trimming sequence, and Read 2 adapter trimming sequence, I guess I need to remove all of these sequences from both F reads and R reads.

Given that the V3–V4 primers are sometimes found in multiple copies (max: 9), I should probably use --p-times 10 in my qiime cutadapt trim-paired commands, to ensure that all copies get removed.

Adapter trimming sequences only ever appear as a single copy, so I guess I don't need to use --p-times on them.

For now, I'm thinking of this approach:

Removing the Read 1 adapter trimming sequence from F and R reads:

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux.qza \
  --p-adapter-f AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
  --p-adapter-r AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
  --verbose > cutadapt_output1.txt \
  --o-trimmed-sequences demux2.qza

qiime demux summarize \
  --i-data demux2.qza \
  --o-visualization demux2.qzv

Removing the Read 2 adapter trimming sequence from F and R reads:

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux2.qza \
  --p-adapter-f AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
  --p-adapter-r AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
  --verbose > cutadapt_output2.txt \
  --o-trimmed-sequences demux3.qza

qiime demux summarize \
  --i-data demux3.qza \
  --o-visualization demux3.qzv

Removing the F primer from F and R reads:

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux3.qza \
  --p-front-f CCTAYGGGRBGCASCAG \
  --p-front-r CCTAYGGGRBGCASCAG \
  --p-times 10 \
  --verbose > cutadapt_output3.txt \
  --o-trimmed-sequences demux4.qza

qiime demux summarize \
  --i-data demux4.qza \
  --o-visualization demux4.qzv

Removing the R primer from F and R reads:

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux4.qza \
  --p-front-f GGACTACNNGGGTATCTAAT \
  --p-front-r GGACTACNNGGGTATCTAAT \
  --p-times 10 \
  --verbose > cutadapt_output4.txt \
  --o-trimmed-sequences demux5.qza

qiime demux summarize \
  --i-data demux5.qza \
  --o-visualization demux5.qzv

Does this seem reasonable to you, @colinvwood?

Also, two other questions:

  • Have you ever encountered a situation like this before, where some of the F reads and R reads contain hits to the F primer, R primer, Read 1 adapter trimming sequence, and Read 2 adapter trimming sequence? I'm surprised that all of these sequences are found in both F and R reads.
  • Do you know why some of my reads have multiple internal copies of the F or R primer in them? I mean, obviously it's an artefact of some sort. Is it common to find multiple internal copies of amplicon primers in 16S reads?

Thanks, as always, for the help!

EDIT:

I just also searched for the reverse complements of the V3–V4 primers, and the reverse complements of the two adapter trimming sequences:

  • Reverse-complemented F primer: CTGSTGCVYCCCRTAGG

  • Reverse-complemented R primer: ATTAGATACCCNNGTAGTCC

  • Reverse-complemented Read 1 adapter trimming sequence: TGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

  • Reverse-complemented Read 2 adapter trimming sequence: ACACTCTTTCCCTACACGACGCTCTTCCGATCT

No hits anywhere for either of the reverse-complemented adapter trimming sequences.

No hits for the reverse-complemented F primer in the R reads.
No hits for the reverse-complemented R primer in the F reads.

But...

Reverse-complemented F primer detected in F reads (8 hits across 7 reads; one read contains two copies).
Reverse-complemented R primer detected in R reads (7 hits across 6 reads; one read contains two copies).

I suppose these reverse-complemented primers should be removed as well? Another cutadapt step needed?

Hello @KQUB,

Some of these statistics are strange. I would have to see the actual data to be able to give you further advice, not sure if you're comfortable sharing a subset of it or something. I can't promise I would be able to get back to you quickly.

One thing to note about your numbers is that they are exact matches only, because you used grep (by the way, did you use a regular expression to accommodate the degenerate bases?). Accounting for mismatches will raise all these values.

1 Like

Hi @colinvwood,

Thanks so much for your continued help!

I've sent you a subset of the data via private message.

I did use regular expression to accommodate the degenerate bases in the V3–V4 primers (the adapter trimming sequences don't have degenerate bases).

Example (searching for the F primer: CCTAYGGGRBGCASCAG):

grep -E "CCTA[CT]GGG[GA][GTC]GCA[GC]CAG"

I agree that accounting for mismatches will generate more hits, but I'm not sure how to do that.

Do you have a go-to way of screening your own 16S reads for primers and adapters? If so, I'd love to know what your process is.

Thanks again!

Hello @KQUB,

I haven't ever really screened for primers or adapters, rather just removed them and then compared the before and after. A google search tells me there's agrep (approximate grep) for this. Or, if you're comfortable with a programming language it wouldn't be too hard to come up with something. FastQC can also do this stuff I believe.

I'll try to get around to looking at your data some time this week, and get back to you after that.

1 Like

Thanks for the tips, @colinvwood!

I think I'm going to also go back through some older datasets to see whether they had the same issues, or if this current dataset is especially strange.

Hi @colinvwood,

Just a quick update. For now, I have simply thrown out all reads that still contained one or more copies of either of the V3–V4 primers, or one or more copies of either of the two adapter trimming sequences. I did this by using several qiime cutadapt trim-paired commands to trim away (one-by-one) all the primer and adapter trimming sequences I could find in my reads (in all orientations and numbers). Then I ran a final qiime cutadapt trim-paired command with the parameter --p-minimum-length 212 to remove all reads that had been trimmed (i.e. all reads that had had some primer(s) or adapter trimming sequence(s) in them). I could do this using --p-minimum-length 212 because my "raw" forward reads were virtually all 227 nt and my reverse reads were virtually all 224 nt, and because the forward V3–V4 primer is 17 nt, the reverse primer is 20 nt, and the two adapter trimming sequences are 33 nt each. Hence, untrimmed (clean) reads passed the cutoff, but trimmed reads didn't. All in all, I threw out 47677 reads (~0.357% of all my reads). Included in that number is also all reads containing Ns. I know that qiime dada2 removes these reads automatically, but I wanted to get some idea how many reads with Ns I had, so I removed them manually too, using the --p-max-n 0 parameter in qiime cutadapt trim-paired.

Maybe there is a better way to deal with this kind of data set, but this was the solution I came to. Perhaps it would be possible to extract some meaningful information from some of those 47677 reads, but I couldn't find a way. For now, moving forward with ~99.643% of my reads seems like a fair trade-off for just being able to proceed with the next step of the data processing.

Thank you again so much for all your help! I hope that this discussion will be useful to others facing similar issues in the future.

1 Like