I have processed 16S Illumina single-end reads (292 bp length) with cutadapt to remove 515F/806R primers.
SInce I included the '--p-discard-untrimmed' option, I assumed that the output reads would be either trimmed, or removed from the data set.
An inspection of a MAFFT alignment on a subset og 1000 reads, however, showed that a considerable number of surviving reads still do contain adapter sequences and were not remove from the data set. (see screenshot)
When I look at the primers you circled on the right, I notice they are not conserved, i.e. they are different from each other. Because these are all different, I wonder if this is why they were not removed by cutadapt. Are you sure these are primers, because I would assume adapters would be much more conserved...
I also noticed that there is ever only an extra sequence at either the 5' end or the 3' end and not both. These might be from the host or other spurious sequences that were amplified. Perhaps you can BLAST a few of these and look at the query coverage, and see if those extra bits of sequence align to anything?
Let's investigate further and see what we can find!
Thank you very much for your comments. I have looked in more detail on those reads that are not well conserved at the 3' end, but they do not align to known reference sequences. The 3' end has also lower phred scores compared to the 5' end, and I assume now, that this is indicating that the Illumina machine produces more errors there.
However, when running cutadapt consecutively with (1) searching for the tail (--p-adapter), and (2) then searching for the front (--p-front), most of the reads were trimmed correctly. In detail, 2,900 reads out of 30,000 reads were left with too long ends in the single cutadapt run (targeting front and tail adapter in a single run), while only 29 reads out of 30,000 were longer than expected after the two consecutive runs.
I have looked into the phred score issue (mentioned above) in more detail. These sample were part of an experiment where I compared classical amplicon primers for 16S-V4 with phased amplicon primers (with one or two additional nucleotide in between the 515F refion and the 5' Illumina extension). These phased reads (using the same DNA preparations) show significantly better phred scores also at towards the 3' ends (I used 293 cycles to generate the R1 reads). The phased reads with higher phred score perform much better in cutadapt compared to a non-phased primer, but the efficieny of primer removal is again lower in a single cutadapt run compared to consecutive runs.
The number together:
non-phased reads: single cutadapt run (2900 of 30551 (9.4%) reads too long)
non-phased reads: consecutive cutadapt runs (29 of 27,680 (0.1%) reads too long; the remaining were discarded)
phased reads: single cutadapt run (1240 of 123,263 (1%) reads too long)
phased reads: consecutive cutadapt runs (37 of 122,110 (0.03% reads too long; the remaining were discarded)
The read quality is of primary importance (ok, this is not new), and phased primer can considerably improve this.
The low number of too-long reads are not critical, because they will probably not survive DADA2 (at least 2 reads must be present to form a ASV).
I still do not know why cutadapt results are different between a single and two consecutive runs.
I use v. q2-2021.11, and I realized that cutadapt in q2-2022.11 as more parameters to choose from. I did not compare these two version yet.
This has been reported quite a while ago for MiSeq, and it is also true in my lab for the iSeq-100:
I am quite sure that the quality is higher. I can see it at the error rates in Illumina's Sequence Analysis Viewer, but also along individual reads (although iSeq-100 applies Illumina's Q-score binning).
Note how in this example, the instrument-derived q-scores are higher (~38) than the Empirical q-scores (~25) calculated by comparing sequences against the known reference. In this example, Illumina is overestimating the q-scores, but you would only discover that by ignoring their estimates and calculating the q-score yourself.
I have testet it with cutadapt of q2-2022.11, using the same parameters as used in q2-2021.11 and leaving the additional paramateras at default/optional.
No surprise, the results are identical to the v2021.11 version: Trimming primers at both ends in one step leaves ca. 1240 reads in the output, while consecutive trimming only showes 37 'too long' reads.
I assume that cutadapt cannot distinguish between one or both primers detected in the reads in the '1-step-run', and keeps the reads even if only on primer has been detected.
I think an additional parameter would be helpful when using --p-adapter AND --p-front together (something like --p-discard-single-trimmed or --p-keep-all-trimmed).
But the 'consecutive mode' does the same.
I would like to transfer our thought to a different scenario: We are analysing also fungal communities by the amplification of the ITS1 region (primers ITS1F and ITS2, amplyfying the ITS1 region; sorry for possible confusions of regions and primers, but the nomenclature of this has historical reasons).
ITS (1 and/or 2) regions do exibit considerable sequence length polymorphism compared to e.g. 16S-V4 regions. Therefore, the right PCR primer (ITS2) is not always present even in 300 bp R1 reads, and therefore the --p-discard-untrimmed option should not be used with --p-adapter.
On the other hand, the left primer (ITS1F) should be present in all reads, and the --p-discard-untrimmed option should be used with --p-front.
As far as I can see, this can be achieved only with consectutive cutadapt runs.
I did not experience any technical issues with these data in downstream analyses.
However, one should be careful when going to merge ITS sequence data of different Illumina runs. They should be obtained using the same sequencing length (e.g. 300 bp) in order to merge dada2 representative sequences. As far as I know, sequences of a 150 bp and a 300 bp mode will receive different hash feature ids even if they are in part identical.
This is less problematic for 16S V4 regions since a 300 bp single-end run will cover the entire V4 region, and if they are trimmed properly, data from different sequencing runs can be easily merged.
But this is more a question of the underlying design of experiments which should be addressed way before you start with the project.