We recently sent some DNA samples (extracted from groundwater and surface water) to Novogene (UK) for amplicon sequencing of the V3–V4 subregions of the 16S rRNA gene.
For 16S amplicons, Novogene uses 2x250 bp paired-end sequencing on the Illumina NovaSeq 6000 system.
I have some questions about the final sequencing data, which was provided to us in 'RawData' and 'CleanData' formats. (I’ve sent Novogene an email with these questions, and have already posted on the Biostars forum as well, but no solutions yet, so thought I'd post here too.)
My understanding is that the 'RawData' folder contains the raw forward and reverse reads for each sample, and the 'CleanData' folder contains cleaned and merged reads for each sample, but the info from Novogene isn't very clear about this.
Here's what's in the 'RawData' folder for one of our samples (K2):
Question 1: If they used 2x250 bp paired-end sequencing, why are our raw reads not 250 nt long?
Prior to this, we have done all our 16S rRNA amplicon sequencing with Macrogen (Korea). They used 2x300 bp paired-end sequencing on the Illumina MiSeq platform, and we always got raw reads back that were all 301 nt long.
For 16S amplicons, Novogene uses 2x250 bp paired-end sequencing on the Illumina NovaSeq 6000 system. However, the length of the 'RawData' reads they sent us isn’t 250 nt. Most of the forward reads (e.g. in K2_1.fastq.gz) are 227 nt, and most of the reverse reads (e.g. in K2_2.fastq.gz) are 224 nt, and there are also some shorter reads mixed in (with lengths like 24 nt, 31 nt, 53 nt, 121 nt...).
I'm confused because I expected 250 nt reads, but that's not what I'm seeing, and I can’t find any information from Novogene about exactly what processing has been done on the 'RawData', if any. (I've asked them, but no reply yet.)
My guess, based on the lengths of the reads we received, is that they have already trimmed the V3–V4 primers away from the reads, as well as an extra 6 nt (see calculations below). I should also mention that I've taken a glance at the beginnings of the forward and reverse reads, and didn't find any V3–V4 primers there.
These are the primers Novogene uses for sequencing the V3–V4 subregions:
F primer: CCTAYGGGRBGCASCAG (17 nt)
R primer: GGACTACNNGGGTATCTAAT (20 nt)
Assuming they trimmed away the primer and an extra 6 nt in each case:
F reads: 250 – 17 – 6 = 227 nt
R reads: 250 – 20 – 6 = 224 nt
I guess the extra 6 nt could have been taken from the 5’ or 3’ end of the reads, or some combination of both. We don't know.
When I tried running DADA2 via (with no extra trimming or truncation) most of the forward and reverse reads did successfully merge, so that’s possibly further confirmation that the initial trimming was likely done (at least mostly) at the beginning of the reads, since there is apparently still enough overlap for merging.
Do other people agree with my reasoning here (i.e. that the "raw" reads aren't actually raw, and have actually already been trimmed to at least remove the primers)?
Or could there be some other explanation?
I wish Novogene gave a more precise description of what they've done on their end.
Question 2: Why are there V3–V4 primer sequences in the middle of some of my reads, and what should I do about that?
Even though the V3–V4 primers appear to have been trimmed away from the beginnings of the 'RawData' reads already, I still found them within some of the reads, just not right at the beginning, which is weird to me (see below). What does this mean?
For example, the F primer sequence CCTAYGGGRBGCASCAG is found in these forward reads:
@A01426:481:HFYHFDRX2:1:2103:18222:23938 1:N:0:TACGACGT+CCATGAAC GGCGACGATCCTTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATCTTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCGTTAGGGGGGTAAAGCTCTTTTGGCCGGGAGGATGATGGCAGTGCCGGGCGGGTCTGGTCGGGGGGCGGCGGGGGCGGGGGGG
@A01426:481:HFYHFDRX2:1:2115:12843:7294 1:N:0:TACGACGT+CCATGAAC ATACCCCCGTAGTCCCGAAACAGATACCCATGTAGTCCGCTCATAGAGAGGGATGCTCTTCCGATGTCCTATGGGAGGCAGCAGTGGGGAATCTTGCACAATGGAGGAAACTCTGATGCAGCGATGCCGCGTGAGTGAAGACGGCCTTTGGGTTGTAAAGCTCTTTTGTAGGGGAAGATAATGACTGTAACCTAAGAATAAGGTCCGGCTAACTTCGTGCCCGCAGC
And the R primer sequence GGACTACNNGGGTATCTAAT is found in these reverse reads:
@A01426:481:HFYHFDRX2:1:2128:10022:6543 2:N:0:TACGACGT+CCATGAAC GTTTCGGGACTACACGGGTATCTAATCCTGTTTGATCCCCACGCTTTCGTGTCTCAGCGTCAGTTACAATTTAGCAAGCTGCCTTCGCAATCGGTGTTCTGTGTGATCTCTAAGCATTTCACCGCTACACCACACATTCCGCCTACTTCAATTGTACTCAAGAATATCAGTTTCTATGGCAGTTCTACAGTTAAGCTGTAGGCTTTCACCACTGACTTAATACC
@A01426:481:HFYHFDRX2:1:2134:2871:31203 2:N:0:TACGACGT+CCATGAAC TCTGCTGCATCCAGGAGGCTGATCGAGTTGTTTAGGGACTACGAGGGTATCTAATACTGTTTGCTCCCCACGCTTTCGTGCATGAGCGTCATTGTTATCCCAGGGGGCTGCCTTCGCCAATGGTATTCCTCCACAGATCTACGCATTTCACTGCTAAACGTGGAATTCTACAACCCTATGACACACACTAGATATACAGTCACATGCGCAATACCCAGGTTAAG
Sorry for the not-very--related post, but if anyone has any insights, please do let me know. I am still very much a noob and have much to learn!