Inconsistent results with discard untrimmed reads in cutadapt

Kat · July 8, 2025, 10:54pm

Hi,

I am processing 16S amplicon data (demultiplexed paired end sequences) from environmental DNA samples that were amplified with the 515F and 806R primers. I ran the following code to remove primers and discard untrimmed sequences:

qiime cutadapt trim-paired
--i-demultiplexed-sequences ./ImportData/demux-paired-end.qza
--p-front-f GTGYCAGCMGCCGCGGTAA
--p-front-r GGACTACNVGGGTWTCTAAT
--p-error-rate 0.2
--p-discard-untrimmed
--output-dir ./TrimData_02
--verbose

Even with a high tolerated error rate, I obtain a pretty poor outcome, with only 33% pairs written as detailed below (which might just be due to the inherent low quality of the DNA samples or sequencing issues):

=== Summary ===

Total read pairs processed: 543
Read 1 with adapter: 187 (34.4%)
Read 2 with adapter: 196 (36.1%)

== Read fate breakdown ==
Pairs that were too short: 0 (0.0%)
Pairs discarded as untrimmed: 363 (66.9%)
Pairs written (passing filters): 180 (33.1%)

Total basepairs processed: 166,652 bp
Read 1: 80,843 bp
Read 2: 85,809 bp
Quality-trimmed: 0 bp (0.0%)
Read 1: 0 bp
Read 2: 0 bp
Total written (filtered): 87,033 bp (52.2%)
Read 1: 36,804 bp
Read 2: 50,229 bp

=== First read: Adapter 1 ===

Sequence: GTGYCAGCMGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 187 times

Minimum overlap: 3
No. of allowed errors:
1-4 bp: 0; 5-9 bp: 1; 10-14 bp: 2; 15-19 bp: 3

Overview of removed sequences
length count expect max.err error counts
4 1 2.1 0 1
5 2 0.5 1 0 2
18 7 0.0 3 2 2 2 1
19 97 0.0 3 96 1
102 6 0.0 3 0 0 5 1
168 4 0.0 3 4
169 1 0.0 3 1
170 1 0.0 3 1
173 3 0.0 3 1 2
175 1 0.0 3 1
176 1 0.0 3 1
178 1 0.0 3 1
188 1 0.0 3 1
190 1 0.0 3 1
193 55 0.0 3 48 7
194 5 0.0 3 5

=== Second read: Adapter 2 ===

Sequence: GGACTACNVGGGTWTCTAAT; Type: regular 5'; Length: 20; Trimmed: 196 times

Minimum overlap: 3
No. of allowed errors:
1-4 bp: 0; 5-9 bp: 1; 10-14 bp: 2; 15-19 bp: 3

Overview of removed sequences
length count expect max.err error counts
18 1 0.0 3 1
19 90 0.0 3 76 12 0 2
20 105 0.0 3 100 0 3 2

My question is the following: when I run "qiime demux summarize" and visualise the characteristics of the original (demux-paired-end.qzv) and trimmed sequences, the sequence count is nearly identical in the two groups (2763672 for untrimmed vs 2736470 for trimmed). How is that possible given that about two thirds of the reads should have been removed with cutadapt?

Please also see attached a screenshot of the quality plots, showing that nucleotides have been trimmed from the start of F and R reads, but it doesn't seem that untrimmed sequences were discarded from the dataset.

Untrimmed:

Trimmed

I would be extremely grateful if you could guide me on where my interpretation or approach is wrong!

Thank you very much in advance for your help!

Kat

colinvwood · July 9, 2025, 6:18pm

Hello @Kat,

Could you attach the visualization files themselves? This will allow us to look at their analysis histories to see what happened.

SoilRotifer · July 9, 2025, 6:36pm

Hi @Kat,

Given your screen shots. It does look like the untrimmed reads were discarded. Note that the X-axis scale goes from ~300 bases down to ~280 bases.... which is consistent with the primers being trimmed.

Often, when I see that only ~ 30-40% of the paired-reads are being trimmed, this signals to me a few potential issues:

The incorrect primers are being used.
The sequencing approach used actually does not contain the primer sequence within the reads, and much of what you might be observing are spurious removals (though I do not think that is the case here).
The reads are in mixed orientation, which can occur with varying sequencing methods. Meaning the reads in the R1 file might contain reads for both the forward and reverse primers, the same for the R2 file too. Thus, the forward primer is only going to be found in the reads that are oriented correctly in R1, and vice versa. If you search the forum for "reads in mixed orientation" you'll see lots of discussion about this.
- If this is indeed the case, fortunately we have a solution for you. You can make use of qiime rescript orient-reads .... Which you can simply use an existing SILVA, GTDB, or Greengenes reference database to help reorient them. Then you can proceed with cutadapt, etc...

-Mike

Kat · July 9, 2025, 10:54pm

Hi Colin,

Thank you for your answer, please find attached my visualisation files, sorry for not sending them before!
trimmed_sequences.qzv (354.0 KB)
demux-paired-end.qzv (339.4 KB)

colinvwood · July 9, 2025, 11:13pm

Hello @Kat,

Thanks for sharing the files. It doesn't look like there were any analysis mixups. What's causing the confusion is the cutadapt output. The cutadapt log that you provided is I believe only a snippet of the entire log. Cutadapt gives you such a "summary" on a per-sample basis. What you attached is just the summary of the "SG2657_J8857" sample, which indeed has 543 reads before trimming (as can be seen in the demux visualization). Thus the proportion of reads trimmed in this summary does not apply to your dataset as a whole, and thus the discrepancy between it and the difference in sequence count you see between the two visualizations. (This is good news, looks like the vast majority of reads were trimmed).

Kat · July 10, 2025, 5:09am

Hi Colin,

Thank you very much for your very helpful reply! I now understand that I had misinterpreted the cutadapt output - I hadn't realised that the summary is provided per sample rather than for the entire dataset. It actually makes sense that the last sample (SG2657_J8857) had a poorer score, since it’s a negative control. When I look at the outputs for the other samples, most show 98–99% of pairs written, which is consistent with the sequence counts reported in the visualisation files.

Thank you again for your help!

Best wishes,
Kat

Kat · July 10, 2025, 5:10am

Hi Mike,

Thank you so much for taking the time to respond and for providing such helpful insights about misoriented reads!

It turns out that the confusion stemmed from my misinterpretation of the cutadapt output (as Colin explained) - the vast majority of my reads were actually trimmed and retained, but the 33% figure I was seeing applied only to the last sample. I’m very grateful that you pointed me to the rescript orient-reads function, which I’m now also testing on my dataset. It’s helpful to know that misorientation can be a common source of low read retention after trimming!

All the best,
Kat

system · August 10, 2025, 11:10am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.