Questions about 16S data from Novogene (UK)

colinvwood · July 24, 2023, 4:53pm

Not clear to me what's going on here (maybe a few different things). Either way, I plan to remove these upstream fragments by trimming the V3–V4 F primer (and everything upstream) from the forward reads, and the V3–V4 R primer (and everything upstream) from the reverse reads, using qiime cutadapt trim-paired with --p-front-f CCTAYGGGRBGCASCAG and --p-front-r GGACTACNNGGGTATCTAAT

I'm also not sure what's going on there. Strange that some of the upstream sequences are 16S and some are non biological. Did you also check those upstream sequences against your adapter sequences?

To ensure there's enough overlap for merging, do you mean?

That and, if you do end up using a truncation value because it gives you better results, to be aware that all reads shorter than your truncation length are discarded. To know how many to expect to be discarded you have to look at the length distribution.

KQUB · July 26, 2023, 10:09am

Hi @colinvwood,

Okay, so yesterday I used grep to do a pretty comprehensive search for the V3–V4 primers and the two adapter trimming sequences in my forward and reverse reads. I searched for each primer and adapter trimming sequence separately, and in every paired combination of "primer plus adapter trimming sequence".

V3–V4 primers:

F primer: CCTAYGGGRBGCASCAG
R primer: GGACTACNNGGGTATCTAAT

Adapter trimming sequences:

Read 1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Read 2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Here's a summary of the results:

Separate searches:

F primer in F reads:
— 18376 total hits across 17414 reads
— 776 reads containing two or more copies of the F primer
— 106 reads containing three or more copies of the F primer
— 36 reads containing four or more copies of the F primer
— 20 reads containing five or more copies of the F primer
— 15 reads containing six or more copies of the F primer
— 4 reads containing seven or more copies of the F primer
— 3 reads containing eight or more copies of the F primer
— 2 reads containing nine copies of the F primer
R primer in F reads:
— 7 total hits across 7 reads (those 7 reads each contain one copy of the R primer)
F primer in R reads:
— 11 total hits across 10 reads (1 read had two copies of the F primer)
R primer in R reads:
— 11449 total hits across 9969 reads
— 834 reads containing two or more copies of the R primer
— 360 reads containing three or more copies of the R primer
— 169 reads containing four or more copies of the R primer
— 77 reads containing five or more copies of the R primer
— 27 reads containing six or more copies of the R primer
— 9 reads containing seven or more copies of the R primer
— 4 reads containing eight copies of the R primer
Read 1 adapter trimming sequence in F reads:
— 136 total hits across 136 reads (those 136 reads each contain one copy of the Read 1 adapter trimming sequence)
Read 2 adapter trimming sequence in F reads:
— 109 total hits across 109 reads (those 109 reads each contain one copy of the Read 2 adapter trimming sequence)
Read 1 adapter trimming sequence in R reads:
— 33 total hits across 33 reads (those 33 reads each contain one copy of the Read 1 adapter trimming sequence)
Read 2 adapter trimming sequence in R reads:
— 31 total hits across 31 reads (those 31 reads each contain one copy of the Read 2 adapter trimming sequence)

Combined searches:

F primer and Read 1 adapter trimming sequence in F reads:
— 16 reads contain at least one copy of both
— Read 1 adapter trimming sequence always present in only one copy, whereas multiple copies of F primer sometimes present (max: 4 in one read)
— F primer sequence(s) always upstream of the Read 1 adapter trimming sequence
F primer and Read 2 adapter trimming sequence in F reads:
— 12 reads contain at least one copy of both
— Read 2 adapter trimming sequence always present in only one copy, whereas multiple copies of F primer sometimes present (max: 3 in one read)
— F primer sequence(s) always upstream of the Read 2 adapter trimming sequence
R primer and Read 1 adapter trimming sequence in F reads:
— no hits
R primer and Read 2 adapter trimming sequence in F reads:
— no hits
F primer and Read 1 adapter trimming sequence in R reads:
— no hits
F primer and Read 2 adapter trimming sequence in R reads:
— no hits
R primer and Read 1 adapter trimming sequence in R reads:
— 1 read contains one copy of both
— R primer upstream of the Read 1 adapter trimming sequence
R primer and Read 2 adapter trimming sequence in R reads:
— no hits

Now, to answer your question...

Answer: In my reads, no adapter trimming sequences are ever found upstream of either one of the V3–V4 primers.

So, how to proceed...

If both the F reads and R reads contain hits to the F primer, R primer, Read 1 adapter trimming sequence, and Read 2 adapter trimming sequence, I guess I need to remove all of these sequences from both F reads and R reads.

Given that the V3–V4 primers are sometimes found in multiple copies (max: 9), I should probably use --p-times 10 in my qiime cutadapt trim-paired commands, to ensure that all copies get removed.

Adapter trimming sequences only ever appear as a single copy, so I guess I don't need to use --p-times on them.

For now, I'm thinking of this approach:

Removing the Read 1 adapter trimming sequence from F and R reads:

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux.qza \
  --p-adapter-f AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
  --p-adapter-r AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
  --verbose > cutadapt_output1.txt \
  --o-trimmed-sequences demux2.qza

qiime demux summarize \
  --i-data demux2.qza \
  --o-visualization demux2.qzv

Removing the Read 2 adapter trimming sequence from F and R reads:

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux2.qza \
  --p-adapter-f AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
  --p-adapter-r AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
  --verbose > cutadapt_output2.txt \
  --o-trimmed-sequences demux3.qza

qiime demux summarize \
  --i-data demux3.qza \
  --o-visualization demux3.qzv

Removing the F primer from F and R reads:

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux3.qza \
  --p-front-f CCTAYGGGRBGCASCAG \
  --p-front-r CCTAYGGGRBGCASCAG \
  --p-times 10 \
  --verbose > cutadapt_output3.txt \
  --o-trimmed-sequences demux4.qza

qiime demux summarize \
  --i-data demux4.qza \
  --o-visualization demux4.qzv

Removing the R primer from F and R reads:

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux4.qza \
  --p-front-f GGACTACNNGGGTATCTAAT \
  --p-front-r GGACTACNNGGGTATCTAAT \
  --p-times 10 \
  --verbose > cutadapt_output4.txt \
  --o-trimmed-sequences demux5.qza

qiime demux summarize \
  --i-data demux5.qza \
  --o-visualization demux5.qzv

Does this seem reasonable to you, @colinvwood?

Also, two other questions:

Have you ever encountered a situation like this before, where some of the F reads and R reads contain hits to the F primer, R primer, Read 1 adapter trimming sequence, and Read 2 adapter trimming sequence? I'm surprised that all of these sequences are found in both F and R reads.
Do you know why some of my reads have multiple internal copies of the F or R primer in them? I mean, obviously it's an artefact of some sort. Is it common to find multiple internal copies of amplicon primers in 16S reads?

Thanks, as always, for the help!

EDIT:

I just also searched for the reverse complements of the V3–V4 primers, and the reverse complements of the two adapter trimming sequences:

Reverse-complemented F primer: CTGSTGCVYCCCRTAGG
Reverse-complemented R primer: ATTAGATACCCNNGTAGTCC
Reverse-complemented Read 1 adapter trimming sequence: TGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
Reverse-complemented Read 2 adapter trimming sequence: ACACTCTTTCCCTACACGACGCTCTTCCGATCT

No hits anywhere for either of the reverse-complemented adapter trimming sequences.

No hits for the reverse-complemented F primer in the R reads.
No hits for the reverse-complemented R primer in the F reads.

But...

Reverse-complemented F primer detected in F reads (8 hits across 7 reads; one read contains two copies).
Reverse-complemented R primer detected in R reads (7 hits across 6 reads; one read contains two copies).

I suppose these reverse-complemented primers should be removed as well? Another cutadapt step needed?

colinvwood · July 26, 2023, 6:42pm

Hello @KQUB,

Some of these statistics are strange. I would have to see the actual data to be able to give you further advice, not sure if you're comfortable sharing a subset of it or something. I can't promise I would be able to get back to you quickly.

One thing to note about your numbers is that they are exact matches only, because you used grep (by the way, did you use a regular expression to accommodate the degenerate bases?). Accounting for mismatches will raise all these values.

KQUB · July 27, 2023, 5:26am

Hi @colinvwood,

Thanks so much for your continued help!

I've sent you a subset of the data via private message.

I did use regular expression to accommodate the degenerate bases in the V3–V4 primers (the adapter trimming sequences don't have degenerate bases).

Example (searching for the F primer: CCTAYGGGRBGCASCAG):

grep -E "CCTA[CT]GGG[GA][GTC]GCA[GC]CAG"

I agree that accounting for mismatches will generate more hits, but I'm not sure how to do that.

Do you have a go-to way of screening your own 16S reads for primers and adapters? If so, I'd love to know what your process is.

Thanks again!

colinvwood · July 31, 2023, 5:24pm

Hello @KQUB,

I haven't ever really screened for primers or adapters, rather just removed them and then compared the before and after. A google search tells me there's agrep (approximate grep) for this. Or, if you're comfortable with a programming language it wouldn't be too hard to come up with something. FastQC can also do this stuff I believe.

I'll try to get around to looking at your data some time this week, and get back to you after that.

KQUB · August 1, 2023, 8:13am

Thanks for the tips, @colinvwood!

I think I'm going to also go back through some older datasets to see whether they had the same issues, or if this current dataset is especially strange.

KQUB · August 9, 2023, 6:00am

Hi @colinvwood,

Just a quick update. For now, I have simply thrown out all reads that still contained one or more copies of either of the V3–V4 primers, or one or more copies of either of the two adapter trimming sequences. I did this by using several qiime cutadapt trim-paired commands to trim away (one-by-one) all the primer and adapter trimming sequences I could find in my reads (in all orientations and numbers). Then I ran a final qiime cutadapt trim-paired command with the parameter --p-minimum-length 212 to remove all reads that had been trimmed (i.e. all reads that had had some primer(s) or adapter trimming sequence(s) in them). I could do this using --p-minimum-length 212 because my "raw" forward reads were virtually all 227 nt and my reverse reads were virtually all 224 nt, and because the forward V3–V4 primer is 17 nt, the reverse primer is 20 nt, and the two adapter trimming sequences are 33 nt each. Hence, untrimmed (clean) reads passed the cutoff, but trimmed reads didn't. All in all, I threw out 47677 reads (~0.357% of all my reads). Included in that number is also all reads containing Ns. I know that qiime dada2 removes these reads automatically, but I wanted to get some idea how many reads with Ns I had, so I removed them manually too, using the --p-max-n 0 parameter in qiime cutadapt trim-paired.

Maybe there is a better way to deal with this kind of data set, but this was the solution I came to. Perhaps it would be possible to extract some meaningful information from some of those 47677 reads, but I couldn't find a way. For now, moving forward with ~99.643% of my reads seems like a fair trade-off for just being able to proceed with the next step of the data processing.

Thank you again so much for all your help! I hope that this discussion will be useful to others facing similar issues in the future.