Questions about 16S data from Novogene (UK)

Hi there,

We recently sent some DNA samples (extracted from groundwater and surface water) to Novogene (UK) for amplicon sequencing of the V3–V4 subregions of the 16S rRNA gene.

For 16S amplicons, Novogene uses 2x250 bp paired-end sequencing on the Illumina NovaSeq 6000 system.

I have some questions about the final sequencing data, which was provided to us in 'RawData' and 'CleanData' formats. (I’ve sent Novogene an email with these questions, and have already posted on the Biostars forum as well, but no solutions yet, so thought I'd post here too.)

My understanding is that the 'RawData' folder contains the raw forward and reverse reads for each sample, and the 'CleanData' folder contains cleaned and merged reads for each sample, but the info from Novogene isn't very clear about this.

image

Here's what's in the 'RawData' folder for one of our samples (K2):

image

Question 1: If they used 2x250 bp paired-end sequencing, why are our raw reads not 250 nt long?

Prior to this, we have done all our 16S rRNA amplicon sequencing with Macrogen (Korea). They used 2x300 bp paired-end sequencing on the Illumina MiSeq platform, and we always got raw reads back that were all 301 nt long.

For 16S amplicons, Novogene uses 2x250 bp paired-end sequencing on the Illumina NovaSeq 6000 system. However, the length of the 'RawData' reads they sent us isn’t 250 nt. Most of the forward reads (e.g. in K2_1.fastq.gz) are 227 nt, and most of the reverse reads (e.g. in K2_2.fastq.gz) are 224 nt, and there are also some shorter reads mixed in (with lengths like 24 nt, 31 nt, 53 nt, 121 nt...).

I'm confused because I expected 250 nt reads, but that's not what I'm seeing, and I can’t find any information from Novogene about exactly what processing has been done on the 'RawData', if any. (I've asked them, but no reply yet.)

My guess, based on the lengths of the reads we received, is that they have already trimmed the V3–V4 primers away from the reads, as well as an extra 6 nt (see calculations below). I should also mention that I've taken a glance at the beginnings of the forward and reverse reads, and didn't find any V3–V4 primers there.

These are the primers Novogene uses for sequencing the V3–V4 subregions:

F primer: CCTAYGGGRBGCASCAG (17 nt)
R primer: GGACTACNNGGGTATCTAAT (20 nt)

So...

Assuming they trimmed away the primer and an extra 6 nt in each case:

F reads: 250 – 17 – 6 = 227 nt
R reads: 250 – 20 – 6 = 224 nt

I guess the extra 6 nt could have been taken from the 5’ or 3’ end of the reads, or some combination of both. We don't know.

When I tried running DADA2 via :qiime2: (with no extra trimming or truncation) most of the forward and reverse reads did successfully merge, so that’s possibly further confirmation that the initial trimming was likely done (at least mostly) at the beginning of the reads, since there is apparently still enough overlap for merging.

Do other people agree with my reasoning here (i.e. that the "raw" reads aren't actually raw, and have actually already been trimmed to at least remove the primers)?

Or could there be some other explanation?

I wish Novogene gave a more precise description of what they've done on their end.

Question 2: Why are there V3–V4 primer sequences in the middle of some of my reads, and what should I do about that?

Even though the V3–V4 primers appear to have been trimmed away from the beginnings of the 'RawData' reads already, I still found them within some of the reads, just not right at the beginning, which is weird to me (see below). What does this mean?

For example, the F primer sequence CCTAYGGGRBGCASCAG is found in these forward reads:

@A01426:481:HFYHFDRX2:1:2103:18222:23938 1:N:0:TACGACGT+CCATGAAC GGCGACGATCCTTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATCTTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCGTTAGGGGGGTAAAGCTCTTTTGGCCGGGAGGATGATGGCAGTGCCGGGCGGGTCTGGTCGGGGGGCGGCGGGGGCGGGGGGG

@A01426:481:HFYHFDRX2:1:2115:12843:7294 1:N:0:TACGACGT+CCATGAAC ATACCCCCGTAGTCCCGAAACAGATACCCATGTAGTCCGCTCATAGAGAGGGATGCTCTTCCGATGTCCTATGGGAGGCAGCAGTGGGGAATCTTGCACAATGGAGGAAACTCTGATGCAGCGATGCCGCGTGAGTGAAGACGGCCTTTGGGTTGTAAAGCTCTTTTGTAGGGGAAGATAATGACTGTAACCTAAGAATAAGGTCCGGCTAACTTCGTGCCCGCAGC

And the R primer sequence GGACTACNNGGGTATCTAAT is found in these reverse reads:

@A01426:481:HFYHFDRX2:1:2128:10022:6543 2:N:0:TACGACGT+CCATGAAC GTTTCGGGACTACACGGGTATCTAATCCTGTTTGATCCCCACGCTTTCGTGTCTCAGCGTCAGTTACAATTTAGCAAGCTGCCTTCGCAATCGGTGTTCTGTGTGATCTCTAAGCATTTCACCGCTACACCACACATTCCGCCTACTTCAATTGTACTCAAGAATATCAGTTTCTATGGCAGTTCTACAGTTAAGCTGTAGGCTTTCACCACTGACTTAATACC

@A01426:481:HFYHFDRX2:1:2134:2871:31203 2:N:0:TACGACGT+CCATGAAC TCTGCTGCATCCAGGAGGCTGATCGAGTTGTTTAGGGACTACGAGGGTATCTAATACTGTTTGCTCCCCACGCTTTCGTGCATGAGCGTCATTGTTATCCCAGGGGGCTGCCTTCGCCAATGGTATTCCTCCACAGATCTACGCATTTCACTGCTAAACGTGGAATTCTACAACCCTATGACACACACTAGATATACAGTCACATGCGCAATACCCAGGTTAAG

Sorry for the not-very-:qiime2:-related post, but if anyone has any insights, please do let me know. I am still very much a noob and have much to learn!

Cheers,

Kevin

Hello @KQUB,

First of all, before this causes any confusion in the future, do you know whether or not your sequences have been demultiplexed? My understanding is that you have a file for each sample, but you only showed the files for one sample in the "RawData" folder in your screenshot?

Question 1: If they used 2x250 bp paired-end sequencing, why are our raw reads not 250 nt long?

Most likely they have removed the adapters and/or the primers for you, as you suspect.

Question 2: Why are there primer sequences in the middle of some of my reads, and what should I do about that?

Probably because they only searched for primers at the beginning and end of reads. You can use our qiime cutadapt trim-paired command to remove primers that occur anywhere in your reads.

Do other people agree with my reasoning here (i.e. that the "raw" reads aren't actually raw, and have actually already been trimmed to at least remove the primers)?

Yes absolutely.

Since it is unclear what quality control exactly has been done to your data, and the sequencing center has not communicated crucial information (which is not uncommon), the safest thing to do is cover all bases yourself:

  • Figure out which adapters were used in your library prep and remove them if present.
  • Remove all remaining occurrences of you primers in your dataset.
  • Demultiplex if it has not been done already

Let me know if this answers all your questions!

1 Like

Hi @colinvwood. Thanks for the reply! :blush:

That's right. Within the 'RawData' folder are folders for each sample: K1, K2, K3, K4 etc., and within each of those folders there are three fastq.gz files (as shown for sample K2 in my original post):

  • one with forward reads (e.g. K2_1.fastq.gz)
  • one with reverse reads (e.g. K2_2.fastq.gz)
  • one 'extendedFrags' file (e.g. K2.extendedFrags.fastq.gz) [I don't know what this file is.]

So, the sequences have already been assigned to the samples they came from (i.e. demultiplexed).

Seems so. Kind of annoying that Novogene didn't explain exactly what trimming or truncation they have done.

Thanks for this! I'll have to read up on it. Again, I wish Novogene had provided more details about what they have done.

This is the info I have from Novogene about the library prep:

Sequencing libraries were generated using TruSeq® DNA PCR-Free Sample Preparation Kit (Illumina, USA) following manufacturer's recommendations and index codes were added.

I went looking for the TruSeq indexes and found this info:

TruSeq™ single index (previously LT) and TruSeq CD index (previously HT)-based kits:

  • Read 1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
  • Read 2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

I searched the forward reads of one of my samples and found hits for both of these indexes:

The following forward read contains the Read 1 index:

@A01426:481:HFYHFDRX2:1:2141:5403:30107 1:N:0:TACGACGT+CCATGAAC

CCAGTTGTAACCGAGCACCGCGGGCACGGCCACGGCCAAACCAATGGCCGTCATGATCAAGGACTCGCCCACGGGGCCGGCCACTTTGTCAATGGACGCTTGACCGGACATGCCGATTTTGACCAGCGCGTTGTAGATACCCGTGTAGTCCCGAAACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTACGACGTATCGCGTATGCCGTCTTCTGCTTGAAAA

The following two forward reads contain the Read 2 index:

@A01426:481:HFYHFDRX2:1:2105:3251:8093 1:N:0:TACGACGT+CCATGAAC

GTCGATAATGATGAACAATCCGGCTATAGTGAGGGCACCAATTGTCAGCAGTTCGGCAAGGCGAGCGACGGGATAGAAGACCTTCAGGTTGAAGAGCCTGACCCCGGCCGAGATGAGAATACCGCCATGGGAGATGCCGACCCACCAGATGAAGGAACCGATGTAGATACCCCTGTAGTCACGAAACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTCCATGAA

@A01426:481:HFYHFDRX2:1:2201:16613:11256 1:N:0:TACGACGT+CCATGAAC

TCGTCATTGAACAGTGCCCGGGTCACCGACTCCGGCACAGAAGTCAAGCCTTCCCCCACGGCAATCTGGGCCATATCTCGCCACGGTGCTGCGGGGTTGAAGGCGTCATCAGTAGATTAGTCCGGCAGTTCCATCCAGTCTCCAGGCTGTTGTTGAGGAGCCCTATAGATACCCGGGTCGTCCCGAAACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTCCAT

I also searched for the indexes in the reverse reads from the same sample, but no hits for either index.

The site where I found the index sequences also says:

When performing sequencing on an Illumina instrument, sequences corresponding to the library adapters can be present in the FASTQ files at the 3’ end of the reads if the read length is longer than the insert size.

So, am I right in thinking that since the amplicon (i.e. insert) size here is ~470 bp and the reads are only 250 nt (or less), any reads that contain these TruSeq index sequences towards their 3' end are actually reads of an unwanted insert of some sort, because any read of the target insert shouldn't be long enough to read into the TruSeq index sequence beyond the end of the insert?

I can imagine that when sequencing forward reads, sequencing beyond the insert could lead to the Read 2 index being near the 3' end of the forward read; and likewise for the Read 1 index when sequencing reverse reads, right? But how could the Read 1 index be in a forward read (as shown in one read above)? That's confusing to me.

Also, am I wrong to expect to find a V3–V4 primer next to each of the TruSeq index sequences in the reads I pasted above? The V3–V4 primers (shown below) don't seem to be in those reads.

F primer: CCTAYGGGRBGCASCAG (17 nt)
R primer: GGACTACNNGGGTATCTAAT (20 nt)

Thanks in advance for any help! I'm very grateful for any insights you can give, because I find all of this quite confusing. :face_with_spiral_eyes:

EDIT: I used the term "index" incorrectly in this post. The sequences I was referring to with this term are, in fact, the "adapter trimming sequences" (found here).

Hello @KQUB,

Technically, those sequences that you found in the protocol and have searched for in your sequences are called adapters. Indexes usually refer to the barcode used to multiplex your samples and are the e.g. TACGACGT+CCATGAAC section of the header.

So, am I right in thinking that since the amplicon (i.e. insert) size here is ~470 bp and the reads are only 250 nt (or less), any reads that contain these TruSeq index sequences towards their 3' end are actually reads of an unwanted insert of some sort, because any read of the target insert shouldn't be long enough to read into the TruSeq index sequence beyond the end of the insert?

Yes that's right. "Unwanted" should probably only refer to the insert's length, not its information--those sequences should still be assumed to be true biological sequences until there's reason to believe otherwise.

I can imagine that when sequencing forward reads, sequencing beyond the insert could lead to the Read 2 index being near the 3' end of the forward read; and likewise for the Read 1 index when sequencing reverse reads, right? But how could the Read 1 index be in a forward read (as shown in one read above)? That's confusing to me.

That is indeed weird, but mis-ligation can happen I believe, and chimeric sequences can form with adapters put into the middle of the sequence. Lots of strange stuff happens in those library prep tubes :person_shrugging:. You should certainly see far more of the adapter that is attached to the flow cell during forward read sequencing in the forward reads than the other adapter.

Also, am I wrong to expect to find a primer next to each of the TruSeq index sequences in the reads I pasted above? The primers (shown below) don't seem to be in those reads.

I'm not sure, you'd have to go read the library prep protocol in detail to see how all the synthetic sequences are pasted together. And even if one would expect to see them together, you already have evidence that the sequencing center has removed some primer sequences and left others. EDIT: actually, no you wouldn't expect to see a primer here because these are inserts that were too short, and so the sequence ends somewhere in the middle of the amplicon, not near a primer.

EDIT: I noticed that your barcodes are also in the three sequences you included in your most recent post :man_facepalming:. You'll need to search for and remove these as well.

2 Likes

Thanks for replying again, @colinvwood.

Would it be even more accurate to say that the sequences I looked for in my reads are the sequencing primers, and that the term 'adapter' refers to the sequencing primer plus barcode (a.k.a. index) plus the sequence that binds to the oligo on the flow cell. This figure seems to suggest this:

See also this quote from the 'Illumina dye sequencing' wiki page:

Adapters contain three different segments: the sequence complementary to solid support (oligonucleotides on flow cell), the barcode sequence (indices), and the binding site for the sequencing primer. Indices are usually six base pairs long and are used during DNA sequence analysis to identify samples.

Also, if the indices (a.k.a. indexes, barcodes) are "usually six base pairs long" that would explain the extra missing 6 nt I mentioned in my original post:

The 17 nt and 20 nt regions are the V3–V4 primers and the 6 nt regions are the barcodes (presumably). But actually no, because the FASTQ headers in my samples give 8 nt barcodes (e.g. TACGACGT+CCATGAAC)? Well, it almost makes sense. :face_with_spiral_eyes:

Okay, but they likely aren't long enough to merge when I run DADA2?

Makes sense!

Yes, I actually noticed that myself but forgot to mention it in my last post. Some more things to remove!

Okay, so I understand I should confirm the removal of all V3–V4 primers and all adapter elements (sequencing primers, barcodes, and regions that bind the flow cell oligos) from my reads. I'm trying to get my head around what that means, in practice. Is this right?:

  • If I find one of the V3–V4 primers at (or near) the beginning of a read, I should remove it and all preceding bases? (What about if it's near the middle of the read? Same thing?)
  • If I find any of the adapter elements near the 3' end of a read, I should remove them and all subsequent bases? I guess that if the adapter is found near the 3' end of a read, the sequencing primer would be the most 5' part of the adapter, so looking for the sequencing primers and removing all subsequent bases should ensure removal of any other adapter elements downstream of the sequencing primer, right?

Thank you again for the help! :blush:

Hello @KQUB,

Would it be even more accurate to say that the sequences I looked for in my reads are the sequencing primers

If you're referring to the first search you did, no, those were your 16S amplicon primers. The primer in the "adapter" is the index primer. Your amplicon primers are part of the insert in that diagram.

and that the term 'adapter' refers to the sequencing primer plus barcode (a.k.a. index) plus the sequence that binds to the oligo on the flow cell.

This is one way of using the term "adapter," but maybe not the right way in your circumstance. Those sequences that you found in the protocol can't possibly take into account the different indices used because there were only two of them, and thus can't meet this definition of an adapter. These sequences that you found are most likely prefixes of the entire three-part adapter, but it's sufficient to search for only these because the remainder of reads after these occurrences are removed by the tools that are used for this process. (My definition of adapter as only the bit that adheres to the flow cell was thus also misleading, sorry).

The 17 nt and 20 nt regions are the primers and the 6 nt regions are the barcodes (presumably). But actually no, because the FASTQ headers in my samples give 8 nt barcodes (e.g. TACGACGT+CCATGAAC)? Well, it almost makes sense.

I wouldn't try to think about this too hard. The number of sequencing cycles (thus the read lengths) are configurable, and you'll probably never know what your original read lengths were because your "raw" reads have already undergone some QC.

Okay, but they likely aren't long enough to merge when I run DADA2?

It will depend on just how short they are. But that was the only point I was making--just let dada2 figure it out.

If I find one of the V3–V4 primers at (or near) the beginning of a read, I should remove it and all preceding bases? (What about if it's near the middle of the read? Same thing?)

If you use cutadapt within qiime this is what will happen, yes (not a choice you have/get to make).

If I find any of the adapter elements near the 3' end of a read, I should remove them and all subsequent bases? I guess that if the adapter is found near the 3' end of a read, the sequencing primer would be the most 5' part of the adapter, so looking for the sequencing primers and removing all subsequent bases should ensure removal of any other adapter elements downstream of the sequencing primer, right?

Yes, again happens by default. This might remove all occurrences of your indices as well, but it's probably still worth looking for them afterwards.

1 Like

Thanks again for the input, @colinvwood, and sorry for any miscommunication.

My current understanding is that these sequences...

TruSeq™ single index (previously LT) and TruSeq CD index (previously HT)-based kits:

Read 1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Read 2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

...are not the full adapter sequences, rather the sequences "used for adapter trimming" when processing reads generated from libraries prepared with TruSeq kits. Illumina says so here and here.

So, given Illumina's three-component definition of 'adapter', I was trying to figure out which parts of the adapter the sequences "used for adapter trimming" correspond to? My previous guess was the "sequencing primer binding sites" (but I'm not sure of that guess and perhaps I didn't express it very well in my last post anyway).

Here's a figure from the Illumina site:

And another from Novogene America:

An aside about barcodes:

I'm not sure if either the IDT for Illumina–TruSeq DNA and RNA UD Indexes or TruSeq DNA and RNA CD Indexes apply to my data. I think neither, because I searched for some of the barcodes from my samples (e.g. GTGTCCTT+CCACAACA and TACGACGT+CCATGAAC) on those pages and couldn't find them. Maybe Novogene uses its own barcoding system. I guess the important thing about the barcodes is that they are listed in the FASTQ headers and can therefore be targeted for removal.

Anyway, back to figuring out which part of the adapters those sequences "used for adapter trimming" correspond to...

On the pages for both IDT for Illumina–TruSeq DNA and RNA UD Indexes and TruSeq DNA and RNA CD Indexes Illumina gives these sequences:

The following sequences are used for adapter trimming.
Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

(reverse complement of Read 2: ACACTCTTTCCCTACACGACGCTCTTCCGATCT)

OpenIndex Adapters
Index 1 (i7) Adapters
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC[i7]ATCTCGTATGCCGTCTTCTGCTTG
Index 2 (i5) Adapters
AATGATACGGCGACCACCGAGATCTACAC[i5]ACACTCTTTCCCTACACGACGCTCTTCCGATCT

Looking at the structure of these OpenIndex Adapters, I can see the first "adapter trimming sequence" (for Read 1) at the start of the Index 1 (i7) Adapter...

Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Index 1 (i7) Adapters
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC[i7]ATCTCGTATGCCGTCTTCTGCTTG

...and the reverse complement of the second "adapter trimming sequence" (for Read 2) at the end of the Index 2 (i5) Adapter:

Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

(reverse complement of Read 2: ACACTCTTTCCCTACACGACGCTCTTCCGATCT)

Index 2 (i5) Adapters
AATGATACGGCGACCACCGAGATCTACAC[i5]ACACTCTTTCCCTACACGACGCTCTTCCGATCT

Can I use this info to identify which parts of the adapters the "adapter trimming sequences" correspond to, in terms of these figures?

image

Let me try...

If the first read goes from the P5 adapter towards the P7 adapter, and I have the following sequence near the 3' end of one of my forward reads (Read 1 "adapter trimming sequence" in bold, barcode in italics, last bit of adapter in bold again)...

@A01426:481:HFYHFDRX2:1:2141:5403:30107 1:N:0:TACGACGT+CCATGAAC
CCAGTTGTAACCGAGCACCGCGGGCACGGCCACGGCCAAACCAATGGCCGTCATGATCAAGGACTCGCCCACGGGGCCGGCCACTTTGTCAATGGACGCTTGACCGGACATGCCGATTTTGACCAGCGCGTTGTAGATACCCGTGTAGTCCCGAAACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTACGACGTATCGCGTATGCCGTCTTCTGCTTGAAAA

...then the "adapter trimming sequence" here is most likely the sequencing primer binding site (upstream of the Index 1 barcode), right?

Index 1 (i7) Adapters
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC[i7]ATCTCGTATGCCGTCTTCTGCTTG

So, in the forward reads, it would make sense to trim away the Read 1 "adapter trimming sequence" GATCGGAAGAGCACACGTCTGAACTCCAGTCA and everything downstream of it. That much makes total sense to me.

But here's another forward read:

@A01426:481:HFYHFDRX2:1:2113:19822:7733 1:N:0:TACGACGT+CCATGAAC
AGAAACCTGAGATACCAGAAATACCGCTGTAACCAGAGATACCAGATGCGCCAGAGATTCCAGAGAAACCACTGTAGCCTGAGATGCCAGATGCGCCAGAGAAGCCTGAGATACCAGATGCACCGCTGTAACCAGAGATACCCGTGTAGTCCCGAAACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTCCATGAACGTGTAGATCTCGGTGGTCGGCGGCTGCT

This one's got the Read 2 "adapter trimming sequence" (in bold) and the other barcode (in italics).

And other forward read with the same thing:

@A01426:481:HFYHFDRX2:1:2203:28158:34851 1:N:0:TACGACGT+CCATGAAC
CCACCACCACTGTCGCACCAGCAACTAAGCCAAGCCCAACAGCTACCGCTACTGCTAACGCGCTCTCCAAACTCGATACTTTGACCCCAACCGTAGATACCCAAGTAGTCCCGAAACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTCCATGAACGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAGTTGGTGGGGGGGGGGGGGGGGGGGGGGGGGGGG

I searched the reverse reads of several of my samples for all of the potential adapter sequences listed above, but didn't find anything.

Since the first read goes from P5 to P7, I would have expected components of the P7 adapter to perhaps be present in some forward reads, and components of the P5 adapter to perhaps be present in some reverse reads, but that's not what I'm seeing.

I'm seeing that components of both adapters are found in forward reads, but neither are found in reverse reads? Is there some reason for this? Trying to wrap my head around it.

Thanks for your help so far! :blush:

Hello @KQUB,

My current understanding is that these sequences... ...are not the full adapter sequences, rather the sequences "used for adapter trimming" when processing reads generated from libraries prepared with TruSeq kits.

I agree. If we adopt the definition of adapters as three-component sequences then they couldn't possibly be the entire sequence because the index necessarily varies. The reason they are used as "adapter trimming sequences" is because once you encounter these sequences you are going to remove all downstream nucleotides anyway because you know they are non-biological.

So, given Illumina's three-component definition of 'adapter', I was trying to figure out which parts of the adapter the sequences "used for adapter trimming" correspond to? My previous guess was the "sequencing primer binding sites" (but I'm not sure of that guess and perhaps I didn't express it very well in my last post anyway).

This is right. Looking at the top library fragment in the top diagram, and remembering that polymerase moves in the 5' -> 3' direction of the new strand, what happens is:

  • A primer complimentary to "Rd2SP" binds
  • Extension begins here into the insert (from right to left)
  • if the insert is short enough, "Rd1SP" begins to get sequenced, then possibly "Index 2", then possibly "P5"

I'm not sure if either the IDT for Illumina–TruSeq DNA and RNA UD Indexes apply to my data. I think neither, because I searched for some of the barcodes from my samples (e.g. GTGTCCTT+CCACAACA and TACGACGT+CCATGAAC) on those pages and couldn't find them. Maybe Novogene uses its own barcoding system. I guess the important thing about the barcodes is that they are listed in the FASTQ headers and can therefore be targeted for removal.

Exactly, your barcodes are still in the fastq headers (sometimes after demultiplexing they are removed) so you don't need to worry about finding those. Although these are something the sequencing center should ideally provide to you.

Since the first read goes from P5 to P7, I would have expected components of the P7 adapter to perhaps be present in some forward reads, and components of the P5 adapter to perhaps be present in some reverse reads, but that's not what I'm seeing.

Agreed.

I'm seeing that components of both adapters are found in forward reads, but neither are found in reverse reads? Is there some reason for this? Trying to wrap my head around it.

Did you also search for the reverse complements of the forward read sequencing primers in the reverse reads? I find it hard to believe that there are no occurrences of adapters in your reverse reads.

Taken together this is indeed strange, I don't have an explanation. Some thoughts:

  • It wouldn't be impossible to see some forward read sequencing primers towards the 3' end of forward reads due to mis-ligation or chimera formation. Do these two types of forward reads (those with the expected read 2 adapters and those with the unexpected read 1 adapters) occur at starkly different frequencies?

  • Reverse reads have lower quality than forward reads, especially towards the 3' ends (sequenced last). If your sequencing center performed quality trimming (which wouldn't be outside the realm of possibility given what we know they've done to "raw" reads), then the adapter sequences may have been removed from reverse reads in this way. You would have to look at the read length distribution of all your reads to find evidence for this.

1 Like

Hi @colinvwood,

Before now I had only glanced at a few samples when talking about adapters in the reads, and my observations were based only on those few samples.

Now I went looking for the different parts of the OpenIndex Adapters in my full dataset:

OpenIndex Adapters
Index 1 (i7) Adapters
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC[i7]ATCTCGTATGCCGTCTTCTGCTTG
Index 2 (i5) Adapters
AATGATACGGCGACCACCGAGATCTACAC[i5]ACACTCTTTCCCTACACGACGCTCTTCCGATCT

I searched for the following sequences:

  1. OpenIndex 1 (i7) Adapter first part
    GATCGGAAGAGCACACGTCTGAACTCCAGTCAC

  2. OpenIndex 1 (i7) Adapter second part
    ATCTCGTATGCCGTCTTCTGCTTG

  3. OpenIndex 2 (i5) Adapter first part (reverse complement)
    GTGTAGATCTCGGTGGTCGCCGTATCATT

  4. OpenIndex 2 (i5) Adapter second part (reverse complement)
    AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Counts of these sequences in forward reads:

  1. 135
  2. 41
  3. 15
  4. 109

Counts of these sequences in reverse reads:

  1. 33
  2. 9
  3. 11
  4. 31

I actually looked for many variations of the above adapter sequences (i.e. the original sequence, reverse, complement, reverse complement). The ones shown above were the only ones that got hits.

So, clearly bits of both adapters are in the forward reads and in the reverse reads. I guess I should try running cutadapt on forward and reverse reads with both of the "adapter trimming sequences" ...

Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

... each time, then re-do the count and see if they've been successfully removed.

If I do this with cutadapt, all parts of both adapters will be removed from all reads simultaneously, right? So, I don't have to then look for and remove barcodes separately?

Good that I checked both forward and reverse reads for both adapters, because otherwise I might have just trimmed the P7 adapter from forward reads and the P5 adapters from reverse reads. Now I know I should trim both adapters from both kinds of reads, and I guess DADA2 will also remove the chimeras, so hopefully after that I'll have some nice data to work with.

Thanks again! I've learnt quite a bunch of stuff from this conversation. :blush:

Hello @KQUB,

So, clearly bits of both adapters are in the forward reads and in the reverse reads. I guess I should try running cutadapt on forward and reverse reads with both of the "adapter trimming sequences" ... ... each time, then re-do the count and see if they've been successfully removed.

Cutadapt will trim both read directions at the same time. The trim-paired action will allow up to four sequences to be searched for: for both read directions, a sequence at the 5' end and a sequence at the 3' end. You should confirm this, but I think that cutadpt will look for the reverse complement of these sequences as well. You can pass multiple sequences to each of these parameters at once, but remember to use the --n-times parameter and to read the cutadapt docs because it gets confusing.

If I do this with cutadapt , all parts of both adapters will be removed from all reads simultaneously, right? So, I don't have to then look for and remove barcodes separately?

Since the barcodes are downstream of the adapter trimming sequences, all of the barcodes should be removed as well afterwards.

Dada2 will remove the chimeras, yes. Best of luck with the rest of your analysis!

1 Like

Hi again, @colinvwood,

Some more questions (hopefully among the last)!

I think I have two clear (logical) objectives for the cutadapt step:

  1. Remove all reads with Ns (i.e. unknown nucleotides). I've seen that some of my reads have Ns, and I know that DADA2 can't handle Ns, so those reads should be removed.
  2. Remove the adapters we might logically expect to see in some of the reads as a result of sequencing through to the other side of a short DNA insert (i.e. removing the P7 adapter from forward reads, and the P5 adapter from reverse reads; or, according to this figure, removing everything downstream of the DNA insert from the forward reads, and removing everything upstream of the DNA insert from the reverse reads):
    image

Given that the V3–V4 primers appear to have already been removed from the ends of my "raw" sequences, one might imagine that fulfilling the two above objectives would be enough for me to then move on to denoising with DADA2; because having done those, I would have reads that are free from the V3–V4 primers, free from the adapters, and free from Ns. But although the bit about removing Ns will be true, the rest won't, actually.

I have seen that some of my forward reads (illogically, I think) contain bits of the P5 adapter, and some of my reverse reads contain bits of the P7 adapter (see my previous post in this conversation). In addition, some of my reads contain the V3–V4 primers in the middle of them (see my first post in this conversation). Those two kinds of reads don't make sense to me, based on my current understanding of Illumina sequencing. Would you agree that those kinds of reads don't follow expected behaviour? If so, are they most likely the result of artefacts arising from the library preparation or sequencing processes? And lastly, is there any way I can just dump these reads, or salvage some useful info from them, or how should I deal with them?

Thanks, as always, for your time! :blush:

Hello @KQUB,

  1. Remove all reads with Ns (i.e. unknown nucleotides). I've seen that some of my reads have Ns, and I know that DADA2 can't handle Ns, so those reads should be removed.

Dada2 takes care of this for you, no need to do this yourself.

I have seen that some of my forward reads (illogically, I think) contain bits of the P5 adapter, and some of my reverse reads contain bits of the P7 adapter (see my previous post in this conversation). In addition, some of my reads contain the V3–V4 primers in the middle of them (see my first post in this conversation).

The simplest thing to do is to search for both adapter prefix sequences and both primers in both of your read directions.

Those two kinds of reads don't make sense to me, based on my current understanding of Illumina sequencing. Would you agree that those kinds of reads don't follow expected behaviour?

Regarding the opposite adapter situation, yes. I wouldn't necessarily have been surprised by some exceptions, but I wouldn't have expected them in the frequencies that you shared.

Regarding amplicon primers in the middle of sequences, that is not that uncommon (primer dimers, chimeras, non-specific amplification, short amplicons, ...).

If so, are they most likely the result of artifacts arising from the library preparation or sequencing processes?

Again, regarding the adapter issue, I think so, unless there is some lack of understanding on my part.

And lastly, is there any way I can just dump these reads, or salvage some useful info from them, or how should I deal with them?

Letting the trimming tools do their thing will amount to salvaging as much as is possible. One could make an argument for discarding these types of reads completely however, because as you say, there is reason to believe they are artifacts of library prep, PCR, or sequencing. That is a subjective decision I suppose.

I wouldn't discard reads that contain amplicon primers in the middle of the sequences because there are 'natural' ways this can occur.

1 Like

Hello @colinvwood,

Back again with some more questions.

Okay, is this something specific to q2-dada2? Because the tutorial on the page for the DADA2 R package says "DADA2 requires no Ns" (see the 'Filter and trim' section here).

This sounds like a good approach!

Are there? How so? (Sorry if you've already mentioned this somewhere above.)

Lastly, can you recommend some reading material for me to help me get my head around all of this stuff? Where did you get all your knowledge about Illumina sequencing, sequencing artefacts etc.? For example, I'd love to know more about how things like primer dimers, chimeras, and non-specific amplification occur and how common they are. Do you know of any resources for that?

Thanks as always for your help! :blush:

Hello @KQUB,

Okay, is this something specific to q2-dada2 ? Because the tutorial on the page for the DADA2 R package says "DADA2 requires no Ns" (see the 'Filter and trim' section here ).

We wrap the same filterAndTrim function that's in that tutorial so it's done for you.

I wouldn't discard reads that contain amplicon primers in the middle of the sequences because there are 'natural' ways this can occur.

Are there? How so? (Sorry if you've already mentioned this somewhere above.)

All I was getting at here is that there is no reason to assume that the sequence upstream of a primer isn't true biological sequence. People generally conserve as much of a read as possible, and don't discard an entire read because it contains a non-biological subsequence.

Lastly, can you recommend some reading material for me to help me get my head around all of this stuff? Where did you get all your knowledge about Illumina sequencing, sequencing artefacts etc.? For example, I'd love to know more about how things like primer dimers, chimeras, and non-specific amplification occur and how common they are. Do you know of any resources for that?

This is one of those areas that to my knowledge doesn't have any good centralized source of information. The illumina knowledge website that you linked above is pretty useful. I've just sort of pieced things together from reading library prep protocols, working in a wet lab, this forum, the cutadapt docs and other scraps around the internet. Case in point, I learned from this discussion that adapters and adapter trimming sequences are not the same thing.

1 Like

Hi @colinvwood,

So, here is a summary of my observations so far:

  • Some of my F and R reads contain Ns (i.e. unknown nucleotides). (Doesn’t matter, q2-dada2 will remove these reads automatically – at least, that's my current understanding.)
  • Some of my F reads contain the V3–V4 F primer; not right at the 5' end of the reads, but nearer the 5' end than the 3' end. (Should be removed using qiime cutadapt trim-paired.)
  • Some of my R reads contain the V3–V4 R primer; not right at the 5' end of the reads, but nearer the 5' end than the 3' end. (Should be removed using qiime cutadapt trim-paired.)
  • Some of my F reads contain the read 1 adapter trimming sequence towards the 3’ end (not entirely unexpected); and some have the read 2 adapter trimming sequence towards the 3’ end (unexpected). (Should be removed using qiime cutadapt trim-paired.)
  • Some of my R reads contain the read 2 adapter trimming sequence towards the 3’ end (not entirely unexpected); and some have the read 1 adapter trimming sequence towards the 3’ end (unexpected). (Should be removed using qiime cutadapt trim-paired.)

Adapter removal

Since the read 1 and read 2 adapters can be found towards the 3' end of some F and R reads, I assume I should use the --p-adapter-f and --p-adapter-r parameters of qiime cutadapt trim-paired to remove them both from F and R reads (removing also all downstream bases).

Adapter trimming sequences:

TruSeq™ single index (previously LT) and TruSeq CD index (previously HT)-based kits:

  • Read 1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
  • Read 2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Removing adapters with qiime cutadapt trim-paired:

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux.qza \
  --p-adapter-f AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
  --p-adapter-r AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT  \
  --verbose > adapter_trim_output.txt \
  --o-trimmed-sequences demux_adapters_trimmed.qza

V3–V4 primer removal

As for the V3–V4 primers, since they are nearer the 5' end rather than the 3' end, I was thinking of using the --p-front-f and --p-front-r parameters of qiime cutadapt trim-paired to remove them (and all upstream bases).

Something like:

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux_adapters_trimmed.qza \
  --p-front-f CCTAYGGGRBGCASCAG \
  --p-front-r GGACTACNNGGGTATCTAAT  \
  --verbose > primer_trim_output.txt \
  --o-trimmed-sequences demux_adapters_and_primers_trimmed.qza

But then I read this comment:

Since the V3–V4 F primer is present in some of my F reads near the 5' end, but not exactly at the 5' end, and the V3–V4 R primer is present in some of my R reads near the 5' end, but not exactly at the 5' end, I would have thought it best to remove the primers and all upstream sequence (assuming that the sequence downstream of the primer might be biological sequence from the 16S rRNA gene). Is this wrong? What's your understanding here? You're saying keep the sequence upstream of the primers by using --p-adapter-f and --p-adapter-r to remove the primers?

Quality trimming at the 3' end

I was thinking of also doing some quality trimming at the 3' ends of the reads.

Here are my raw (binned) quality plots [NovaSeq 6000 data]:

Our data seem to have only four possible Q-scores (2, 11, 25, 37):

  • 2 = #
  • 11 = ,
  • 25 = :
  • 37 = F

Would it be good to add a --p-quality-cutoff-3end 3 parameter to the qiime cutadapt trim-paired command to trim all nucleotides with Q-score of 2 from the 3' ends of the reads? Or should I even remove the nucleotides with Q-score 11 too, by using --p-quality-cutoff-3end 12?

I guess I could instead do this at the qiime dada2 denoise-paired step, using the --p-trunc-q parameter.

My attempt at a single qiime cutadapt trim-paired command to do all of this simultaneously

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux.qza \
  --p-front-f CCTAYGGGRBGCASCAG \
  --p-front-r GGACTACNNGGGTATCTAAT \
  --p-adapter-f AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
  --p-adapter-r AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
  --p-quality-cutoff-3end 12 \
  --verbose > cutadapt_output.txt \
  --o-trimmed-sequences demux_adapters_and_primers_trimmed.qza

Am I on the right track? You mentioned an --n-times parameter earlier in this thread (I suppose you mean --p-times). Do I need to include that here? And if so, why?

Dealing with short reads

I also thought about adding a --p-minimum-length parameter (maybe --p-minimum-length 180) to remove very short reads. I seem to have a small number of very short reads (shortest F read length = 24 nt, shortest R read length = 31 nt) in my data. Is it worth removing those, or will q2-dada2 deal with them?

Thanks in advance for any advice you can give, @colinvwood! I am learning a lot here. I think you have earned an acknowledgment on my thesis at this point.

Hello @KQUB,

Since the read 1 and read 2 adapters can be found towards the 3' end of some F and R reads, I assume I should use the --p-adapter-f and --p-adapter-r parameters of qiime cutadapt trim-paired to remove them both from F and R reads (removing also all downstream bases).

Be careful when passing multiple trimming sequences to cutadapt at once, see this post for an example of why. In this case it shouldn't really matter because it would be exceedingly strange to see two adapters in a single read, but I would probably still do them separately or use the --p-times option.

Since the V3–V4 F primer is present in some of my F reads near the 5' end, but not exactly at the 5' end, and the V3–V4 R primer is present in some of my R reads near the 5' end, but not exactly at the 5' end, I would have thought it best to remove the primers and all upstream sequence (assuming that the sequence downstream of the primer might be biological sequence from the 16S rRNA gene).

Yes you are right. The upstream comment I made was referring to the adapters, which you only expect to see at the 3' end. At the 5' end you remove the preceding sequence not the subsequent sequence as the cutadapt help text says.

By the way I would run the primer trimming before the adapter trimming, not the other way around. This is because you expect the primers to be nested within the adapters (see the diagrams of the library fragments above), so trimming primers will in most cases also trim the adapters.

I was thinking of also doing some quality trimming at the 3' ends of the reads.

Those sequences do not look like they need any quality filtering. Even if the quality were lower towards the end of the sequences, you could parameterize the trimming positions to dada2, not dedicate an entire extra step to only quality filtering.

This is possible, but dada2 takes quality scores into account in its algorithm, so this would mostly be personal preference.

I also thought about adding a --p-minimum-length parameter (maybe --p-minimum-length 180 ) to remove very short reads. I seem to have a small number of very short reads (shortest F read length = 24 nt, shortest R read length = 31 nt) in my data. Is it worth removing those, or will q2-dada2 deal with them?

Dada2 will deal them by discarding all reads shorter than the truncation length you provide for each read direction.

2 Likes

Hi, @colinvwood,

Good advice, it seems. I also found some other forum posts where the solution was to run two separate cutadapt commands: here, here and here. I will probably do the same, then.

I'm not sure I follow. I have been talking about removing the V3–V4 primers (a.k.a. the amplicon primers, the PCR primers) from near (but not at) the 5' ends of my reads. As far as I understand, these primers are not parts of the adapters. They should instead be part of the DNA insert shown below. The sequencing primers (or sequencing primer binding sites; i.e. Rd1 SP and Rd2 SP) are parts of the adapters, but the V3–V4 primers aren't, right? My understanding is that it therefore doesn't matter whether the adapters or V3–V4 primers are trimmed first.

image

Is that true? What makes you say so? I've read that quality plots for NovaSeq data tend to look different to, for example, quality plots for MiSeq data because of the binning of quality scores (a.k.a. Q-scores): other examples here, here, here, and here. In other words, it's my understanding that the Q-score for a given base is an average of many grouped bases. Therefore, the "actual" Q-score for a given base could be lower than expected. I'm no expert on this, though.

Whatever the case, I guess we can only work with the Q-scores we have. Here are some quotes from an Illumina PDF about NovaSeq™ 6000 System Quality Scores:

"A Q-score of 30 (Q30) corresponds to a 0.1 percent error rate in base calling, and is widely considered a benchmark for high-quality data."

"The three groups in the quality table correspond to marginal (<Q15), medium (~Q20), and high-quality (>Q30) base calls, and are assigned the specific scores of 12, 23, and 37 respectively.* Additionally, a null score of 2 is assigned to any no-calls."

The Q-scores for our data (i.e. 2, 11, 25, 37) are slightly different than those described above (i.e. 2, 12, 23, 37), but the idea seems the same. Would it not make sense to trim all bases with Q-scores < 25 from the 3' ends of our reads, so that we are left with only medium and high-quality base calls at the 3' end?

Or are my quality-filtering ideas a waste of time? Because...

I know I could use the truncation function of qiime dada2 denoise-paired to trim bases from the 3' end of reads instead of doing any kind of quality-based filtering, but it seems like it could be trickier to chose the truncation values with NovaSeq data than with, for example, MiSeq data, because the quality plots from demux.qzv don't seem to give as clear of a indication as to where would be a good position to trim. Or what do you think?

Thanks as always for your time, @colinvwood! :blush:

Hello @KQUB,

I'm not sure I follow. I have been talking about removing the V3–V4 primers (a.k.a. the amplicon primers, the PCR primers) from near (but not at) the 5' ends of my reads. As far as I understand, these primers are not parts of the adapters. They should instead be part of the DNA insert shown below. The sequencing primers (or sequencing primer binding sites; i.e. Rd1 SP and Rd2 SP) are parts of the adapters, but the V3–V4 primers aren't, right? My understanding is that it therefore doesn't matter whether the adapters or V3–V4 primers are trimmed first.

This is all correct. When a read reads into an adapter one would generally expect that to look like this:

5' [16S gene] [amplicon primer] [adapter] 3'

So trimming primers first will remove the primer and the adapter in these cases. Trimming the adapter first will leave the primer. Ultimately the order probably wouldn't make a difference but it seems smarter to trim the primers first for this reason.

Is that true? What makes you say so? I've read that quality plots for NovaSeq data tend to look different to, for example, quality plots for MiSeq data because of the binning of quality scores (a.k.a. Q-scores) [...] In other words, it's my understanding that the Q-score for a given base is an average of many grouped bases. Therefore, the "actual" Q-score for a given base could be lower than expected. I'm no expert on this, though.

That is all my understanding too, however the 25th percentile of these averages never drops below 25 which is really not bad at all. You can look at other posts on this forum and see that people typically see much worse quality drop off towards the end of reads, especially in the reverse reads.

But the quality scores themselves aren't really the point. The more information provided to dada2 the more accurately it can perform its denoising algorithm. Even when people have very poor scores towards the end of their reads, the entire reads are passed to dada2, along with the desired truncation length.

I know I could use the truncation function of qiime dada2 denoise-paired to trim bases from the 3' end of reads instead of doing any kind of quality-based filtering, but it seems like it could be trickier to chose the truncation values with NovaSeq data than with, for example, MiSeq data, because the quality plots from demux.qzv don't seem to give as clear of a indication as to where would be a good position to trim. Or what do you think?

To me, even taking into account their binned nature, the quality scores don't indicate the need to truncate. You'll have to pay attention to your read length distribution going into dada2 however, because of the primer and adapter trimming steps.

Although it's good to think through this stuff, even better is to experiment with different parameters and see what gives you the best results.

2 Likes

Hi @colinvwood,

True, but I'm talking about something else. (Apologies if I wasn't clear.)

I'm talking about this situation, which I observed in some reads:

  • 5' [something] [V3–V4 forward primer] [16S gene fragment] 3' in some forward reads
  • 5' [something] [V3–V4 reverse primer] [16S gene fragment] 3' in some reverse reads

Examples (from the first post in this discussion):

I searched the sequences downstream of the primers with BLAST and got hits to 16S rRNA genes in each case. Makes sense.

The sequences upstream of the primers, I'm less sure about. Here are the BLAST results for them. (I got the same results with megablast and blastn):

F reads:

  • 100% identity to several 16S rRNA genes
  • No significant similarity found.

R reads:

  • No significant similarity found.
  • No significant similarity found.

Not clear to me what's going on here (maybe a few different things). Either way, I plan to remove these upstream fragments by trimming the V3–V4 F primer (and everything upstream) from the forward reads, and the V3–V4 R primer (and everything upstream) from the reverse reads, using qiime cutadapt trim-paired with --p-front-f CCTAYGGGRBGCASCAG and --p-front-r GGACTACNNGGGTATCTAAT.

Okay, I'll try DADA2 with no truncation then.

To ensure there's enough overlap for merging, do you mean?

Got it! I'll try some things now and see what happens.

Thanks for your help (and your patience with all my questions)!