Thanks very much @SoilRotifer! ![]()
I realized I must be missing some critical information, because it makes no sense to me how there can be no primer sequence at the 5' end of a raw read, regardless of read-through or not.
I thought maybe the reason MiSeq data does not include primer sequence (unless there's read-through) is because the Illumina platform only records incorporation of fluorescent nucleotides in the fastq data that are created. The primers themselves are made with untagged, non-fluorescent bases, and therefore are not represented in the corresponding fastq files (unless there's read-through).
See for example the last panel labeled 'Sequencing' in this image, from this link:
In the 'Sequencing' panel, the grey and black empty boxes representing sequencing primers are not fluorescent, and so are not recorded in the 'Signal scanning' used to generate the digital files.
Maybe because I'm primarily a wet-lab biologist, I wasn't making a distinction between the sequence of literal DNA molecules in the tube of a sequencing reaction, and the digital DNA sequence file created from those molecules.
But, if that were the case, then why for example are there so many primers removed during the DADA2 ITS tutorial, that are not primer read-through?
For example, we have this quality-control output before trimming primers:
We can see that there are 3743 and 3590 examples of expected primer read-through; I understand this is typical for variable-length ITS sequence.
However, we also see 4214 and 4200 primers that are not due to primer read-through. This is also a MiSeq dataset. These are the Forward primers in the Forward read, and Reverse primers in the Reverse read, if I'm understanding correctly. They would not be fluorescent in the DNA molecules, yet they are still represented in the digital fastq files.
I must be missing some additional piece(s) of information to explain the presence of the non-read-through primers here. Because otherwise, all the q2 tutorials are including primer sequence in downstream analysis, as shown in the DADA2 screencap above. Whereas, you were writing above that there are no primers for these sequencing protocols, because only read-through creates primer seq in the data.
My questions are:
-
wouldn't all MiSeq sequencing reads have forward primers in the forward reads, and reverse primers in the reverse reads, at the 5' ends, as shown in the DADA2 screencap 4214 and 4200 columns above?
-
If so, wouldn't these primer sequences need to be trimmed before analysis?
Hopefully the apparent contradiction I'm trying to resolve is clear:
either the q2 tutorials are correct in that no 5' end primer trimming is needed, because the primer seq doesn't exist in the digital files (not fluorescent?). Or, the DADA2 ITS tutorial is correct, that these 5' end primers do exist in all sequencing reads regardless of read-through, and should be removed. They can't both be correct, is my understanding.
Maybe the critical piece(s) of info I'm missing are related to this comment from one of your previous helpful posts:
This is assuming the sequencing protocol used sequences through the primer, i.e. is present in the 5' region of your reads.
My understanding is that the structure of un-merged reads with NO read-through would be as follows:
Forward reads:
5' -- Forward seq-primer -- DNA insert --- some noisy N's; no R primer seq -- 3'
Reverse reads:
5' -- Reverse seq-primer -- DNA insert -- some noisy N's, no F primer seq -- 3'
Structure of un-merged reads that DO have primer read-through:
Forward reads:
5' -- Forward seq-primer -- DNA insert --- RevComp of Reverse seq-primer -- 3'
Reverse reads:
5' -- Reverse seq-primer -- DNA insert -- RevComp of Forward seq-primer -- 3'
So, wouldn't read-through be at the 3' end of reads (the RevComp), not the 5' end, like your quote above?
Thanks again for any suggestions or info! ![]()

