Hi,
I am processing 16S amplicon data (demultiplexed paired end sequences) from environmental DNA samples that were amplified with the 515F and 806R primers. I ran the following code to remove primers and discard untrimmed sequences:
qiime cutadapt trim-paired
--i-demultiplexed-sequences ./ImportData/demux-paired-end.qza
--p-front-f GTGYCAGCMGCCGCGGTAA
--p-front-r GGACTACNVGGGTWTCTAAT
--p-error-rate 0.2
--p-discard-untrimmed
--output-dir ./TrimData_02
--verbose
Even with a high tolerated error rate, I obtain a pretty poor outcome, with only 33% pairs written as detailed below (which might just be due to the inherent low quality of the DNA samples or sequencing issues):
=== Summary ===
Total read pairs processed: 543
Read 1 with adapter: 187 (34.4%)
Read 2 with adapter: 196 (36.1%)== Read fate breakdown ==
Pairs that were too short: 0 (0.0%)
Pairs discarded as untrimmed: 363 (66.9%)
Pairs written (passing filters): 180 (33.1%)Total basepairs processed: 166,652 bp
Read 1: 80,843 bp
Read 2: 85,809 bp
Quality-trimmed: 0 bp (0.0%)
Read 1: 0 bp
Read 2: 0 bp
Total written (filtered): 87,033 bp (52.2%)
Read 1: 36,804 bp
Read 2: 50,229 bp=== First read: Adapter 1 ===
Sequence: GTGYCAGCMGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 187 times
Minimum overlap: 3
No. of allowed errors:
1-4 bp: 0; 5-9 bp: 1; 10-14 bp: 2; 15-19 bp: 3Overview of removed sequences
length count expect max.err error counts
4 1 2.1 0 1
5 2 0.5 1 0 2
18 7 0.0 3 2 2 2 1
19 97 0.0 3 96 1
102 6 0.0 3 0 0 5 1
168 4 0.0 3 4
169 1 0.0 3 1
170 1 0.0 3 1
173 3 0.0 3 1 2
175 1 0.0 3 1
176 1 0.0 3 1
178 1 0.0 3 1
188 1 0.0 3 1
190 1 0.0 3 1
193 55 0.0 3 48 7
194 5 0.0 3 5=== Second read: Adapter 2 ===
Sequence: GGACTACNVGGGTWTCTAAT; Type: regular 5'; Length: 20; Trimmed: 196 times
Minimum overlap: 3
No. of allowed errors:
1-4 bp: 0; 5-9 bp: 1; 10-14 bp: 2; 15-19 bp: 3Overview of removed sequences
length count expect max.err error counts
18 1 0.0 3 1
19 90 0.0 3 76 12 0 2
20 105 0.0 3 100 0 3 2
My question is the following: when I run "qiime demux summarize" and visualise the characteristics of the original (demux-paired-end.qzv) and trimmed sequences, the sequence count is nearly identical in the two groups (2763672 for untrimmed vs 2736470 for trimmed). How is that possible given that about two thirds of the reads should have been removed with cutadapt?
Please also see attached a screenshot of the quality plots, showing that nucleotides have been trimmed from the start of F and R reads, but it doesn't seem that untrimmed sequences were discarded from the dataset.
Untrimmed:
Trimmed
I would be extremely grateful if you could guide me on where my interpretation or approach is wrong!
Thank you very much in advance for your help!
Kat