Cutadapt - discard untrimmed

arwqiime · February 22, 2023, 11:13am

Hello,
I have processed 16S Illumina single-end reads (292 bp length) with cutadapt to remove 515F/806R primers.
SInce I included the '--p-discard-untrimmed' option, I assumed that the output reads would be either trimmed, or removed from the data set.
An inspection of a MAFFT alignment on a subset og 1000 reads, however, showed that a considerable number of surviving reads still do contain adapter sequences and were not remove from the data set. (see screenshot)

How can I force cutadapt to remove these untrimmed reads?
Best regards,

I used this q2 command:
qiime cutadapt trim-single
--i-demultiplexed-sequences 1_Z02-Exp011-orig.qza
--p-cores 40
--p-times 3
--p-minimum-length 150
--o-trimmed-sequences 1_Z02-Exp011-orig-cutadapt.qza
--p-error-rate 0.2
--p-adapter ATTAGAWACCCBNGTAGTCC
--p-front GTGYCAGCMGCCGCGGTAA
--p-discard-untrimmed \

colinbrislawn · February 27, 2023, 1:51pm

I appreciate your patience.

When I look at the primers you circled on the right, I notice they are not conserved, i.e. they are different from each other. Because these are all different, I wonder if this is why they were not removed by cutadapt. Are you sure these are primers, because I would assume adapters would be much more conserved...

I also noticed that there is ever only an extra sequence at either the 5' end or the 3' end and not both. These might be from the host or other spurious sequences that were amplified. Perhaps you can BLAST a few of these and look at the query coverage, and see if those extra bits of sequence align to anything?

Let's investigate further and see what we can find!

arwqiime · February 28, 2023, 12:01pm

Hi @colinbrislawn
Thank you very much for your comments. I have looked in more detail on those reads that are not well conserved at the 3' end, but they do not align to known reference sequences. The 3' end has also lower phred scores compared to the 5' end, and I assume now, that this is indicating that the Illumina machine produces more errors there.
However, when running cutadapt consecutively with (1) searching for the tail (--p-adapter), and (2) then searching for the front (--p-front), most of the reads were trimmed correctly. In detail, 2,900 reads out of 30,000 reads were left with too long ends in the single cutadapt run (targeting front and tail adapter in a single run), while only 29 reads out of 30,000 were longer than expected after the two consecutive runs.

I have looked into the phred score issue (mentioned above) in more detail. These sample were part of an experiment where I compared classical amplicon primers for 16S-V4 with phased amplicon primers (with one or two additional nucleotide in between the 515F refion and the 5' Illumina extension). These phased reads (using the same DNA preparations) show significantly better phred scores also at towards the 3' ends (I used 293 cycles to generate the R1 reads). The phased reads with higher phred score perform much better in cutadapt compared to a non-phased primer, but the efficieny of primer removal is again lower in a single cutadapt run compared to consecutive runs.

The number together:
non-phased reads: single cutadapt run (2900 of 30551 (9.4%) reads too long)
non-phased reads: consecutive cutadapt runs (29 of 27,680 (0.1%) reads too long; the remaining were discarded)
phased reads: single cutadapt run (1240 of 123,263 (1%) reads too long)
phased reads: consecutive cutadapt runs (37 of 122,110 (0.03% reads too long; the remaining were discarded)

Taken together:
The read quality is of primary importance (ok, this is not new), and phased primer can considerably improve this.
The low number of too-long reads are not critical, because they will probably not survive DADA2 (at least 2 reads must be present to form a ASV).
I still do not know why cutadapt results are different between a single and two consecutive runs.

I use v. q2-2021.11, and I realized that cutadapt in q2-2022.11 as more parameters to choose from. I did not compare these two version yet.

Best regards,

colinbrislawn · February 28, 2023, 2:22pm

This is fascinating! I did not know cutadapt would perform differently when run twice on the same data set.

I think this part is new, and worth exploring more. I consider the quality of the Illumina sequencing as a random, so primer design to maximize quality is interesting.

Have you confirmed that the quality is actually higher, or is Illumina just reporting higher q-scores? (Q scores are an estimate of quality, and this estimate could be wrong.)

I would be interested to see how this newest version works.

Thank you for sharing all you have discovered. Please report back once you learn more!

arwqiime · February 28, 2023, 2:54pm

Hi @colinbrislawn

This has been reported quite a while ago for MiSeq, and it is also true in my lab for the iSeq-100:

I am quite sure that the quality is higher. I can see it at the error rates in Illumina's Sequence Analysis Viewer, but also along individual reads (although iSeq-100 applies Illumina's Q-score binning).

colinbrislawn · February 28, 2023, 4:18pm

Ah, I think I was thinking of something slightly different. See Figure 4 from this paper:

Note how in this example, the instrument-derived q-scores are higher (~38) than the Empirical q-scores (~25) calculated by comparing sequences against the known reference. In this example, Illumina is overestimating the q-scores, but you would only discover that by ignoring their estimates and calculating the q-score yourself.

arwqiime · March 1, 2023, 7:50am

@colinbrislawn
Ok, I see what you mean. Could you add the doi to that paper, it got lost in the link.
Best regards,

colinbrislawn · March 2, 2023, 1:12am

My mistake!

I've edited the link so it uses PubMed

arwqiime · March 2, 2023, 3:18pm

Hi @colinbrislawn

I have testet it with cutadapt of q2-2022.11, using the same parameters as used in q2-2021.11 and leaving the additional paramateras at default/optional.
No surprise, the results are identical to the v2021.11 version: Trimming primers at both ends in one step leaves ca. 1240 reads in the output, while consecutive trimming only showes 37 'too long' reads.

I assume that cutadapt cannot distinguish between one or both primers detected in the reads in the '1-step-run', and keeps the reads even if only on primer has been detected.
I think an additional parameter would be helpful when using --p-adapter AND --p-front together (something like --p-discard-single-trimmed or --p-keep-all-trimmed).
But the 'consecutive mode' does the same.

Best regards ,

colinbrislawn · March 2, 2023, 5:15pm

What if we tried a different approach:

What if we filter for the region of interest, instead of against the primers.

Try this:

EDIT: Looks like extract-seq-seqments will only work on FASTA files, not FASTQ. So that will not work well for your idea, though the idea is good.

arwqiime · March 3, 2023, 3:08pm

Good ideas are always great!

I would like to transfer our thought to a different scenario: We are analysing also fungal communities by the amplification of the ITS1 region (primers ITS1F and ITS2, amplyfying the ITS1 region; sorry for possible confusions of regions and primers, but the nomenclature of this has historical reasons).
ITS (1 and/or 2) regions do exibit considerable sequence length polymorphism compared to e.g. 16S-V4 regions. Therefore, the right PCR primer (ITS2) is not always present even in 300 bp R1 reads, and therefore the --p-discard-untrimmed option should not be used with --p-adapter.
On the other hand, the left primer (ITS1F) should be present in all reads, and the --p-discard-untrimmed option should be used with --p-front.

As far as I can see, this can be achieved only with consectutive cutadapt runs.
Well, great!
Best regards,

colinbrislawn · March 5, 2023, 3:08pm

That makes sense to me, based on your explanation. You mentioned you tried running it like this, and it seems to work. Is this causing any other issues?

arwqiime · March 6, 2023, 11:07am

I did not experience any technical issues with these data in downstream analyses.

However, one should be careful when going to merge ITS sequence data of different Illumina runs. They should be obtained using the same sequencing length (e.g. 300 bp) in order to merge dada2 representative sequences. As far as I know, sequences of a 150 bp and a 300 bp mode will receive different hash feature ids even if they are in part identical.

This is less problematic for 16S V4 regions since a 300 bp single-end run will cover the entire V4 region, and if they are trimmed properly, data from different sequencing runs can be easily merged.

But this is more a question of the underlying design of experiments which should be addressed way before you start with the project.

system · April 6, 2023, 5:08pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.