Greetings qiime2 forum once again! I am in the process of testing workflows for processing data from 16S amplicons of different sizes as a result of amplification with different primer sets (V4, a general prokaryotic set, and a more specific set). All over lap in the region 515F to 805R.
V4 - 515F/806R - yields a 291 bp amplicon
Pro - 341F/805R - yields a 464 bp amplicon
Arc - 515F/915R - yields a 400 bp amplicon
In searching the forum about workflows to process all these datasets - there were three workflows suggested per the following links along with some caveats to keep in mind via @Nicholas_Bokulich:
I've done two workflows with q2-insertion with different preprocessing prior and now I'm working on a workflow that trims of the primers, separate denoising/cleaning with dada2, followed by merge and q2 analysis.
I'd like to verify that I am utilizing the cutadapt plugin correctly - as the flags for the plug in are a bit different than the cutadapt docs. My data is already demultiplexed so I'm basically looking to cut off the ends by primer to get all reads regardless of dataset down to 291 bps (the size of the smallest amplicon overlapping region).
My Pro primers amplify a region from 341F - 805R. My V4 region amplifies starting at 515, not 341, so I basically need to cut off ~174 bps. Per @Nicholas_Bokulich suggestion I wasn't going to do a --p-trim-left-f
of 174 because of the differences in amplicon length and instead I was going to trim based on the V4 primer sequence which starts at 515. So of course this assumes that my Pro dataset has the sequence that matches around 515 to my V4 primer. I think this is a fairly safe assumption since it's all 16S and all bacteria - all same samples from the same exact experiment - just sequenced with different primers.
I uploaded all data into a demux.qza fine. I viewed the original Pro-pe-demux.qzv not trimmed with cutadapt (see attached).Pro-pe-demux.qzv (287.4 KB). Typically at this point, if I was processing a single qza, I'd run dada2 after looking at the histograms and trim
or trunc
based on where quality starts to tank. But instead I'm cutting primers off...
My run of it - Using the V4 primer sequence which starts at 515F (GTGCCAGCMGCCGCGGTAA) to trim off the Pro dataset that starts at 341. And then the normal Pro primer 805R (GACTACNVGGGTATCTAATCC) (FYI - I'm pasting the entire output of the run as I did it in verbose) - My goal again is to reduce this sequence to 291 bps - so I can eventually combine it with the V4 dataset which is already at 291 bps after denoising. I will eventually have to repeat this process for the Arc dataset - but before I launch into that I wanted to make sure what I have now is making sense...
(qiime2-2018.6) wsb255bioimac27:Issue_0033_cutadapt_qiime2 mel_local$ qiime cutadapt trim-paired --p-front-f GTGCCAGCMGCCGCGGTAA --p-front-r GACTACNVGGGTATCTAATCC --i-demultiplexed-sequences Pro-pe-demux.qza --o-trimmed-sequences Pro-pe-demux-primerTrim.qza --verbose
Running external command line application. This may print messages to stdout and/or stderr.
The commands to be run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
Command: cutadapt --cores 1 --error-rate 0.1 --times 1 --overlap 3 -o /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-5-Pro_0_L001_R1_001.fastq.gz -p /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-5-Pro_1_L001_R2_001.fastq.gz --front GTGCCAGCMGCCGCGGTAA -G GACTACNVGGGTATCTAATCC /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-5-Pro_0_L001_R1_001.fastq.gz /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-5-Pro_1_L001_R2_001.fastq.gz
This is cutadapt 1.16 with Python 3.5.5
Command line parameters: --cores 1 --error-rate 0.1 --times 1 --overlap 3 -o /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-5-Pro_0_L001_R1_001.fastq.gz -p /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-5-Pro_1_L001_R2_001.fastq.gz --front GTGCCAGCMGCCGCGGTAA -G GACTACNVGGGTATCTAATCC /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-5-Pro_0_L001_R1_001.fastq.gz /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-5-Pro_1_L001_R2_001.fastq.gz
Running on 1 core
Trimming 2 adapters with at most 10.0% errors in paired-end mode ...
Finished in 4.45 s (42 us/read; 1.44 M reads/minute).
=== Summary ===
Total read pairs processed: 106,663
Read 1 with adapter: 102,365 (96.0%)
Read 2 with adapter: 105,574 (99.0%)
Pairs written (passing filters): 106,663 (100.0%)
Total basepairs processed: 64,211,126 bp
Read 1: 32,105,563 bp
Read 2: 32,105,563 bp
Total written (filtered): 43,295,188 bp (67.4%)
Read 1: 13,407,567 bp
Read 2: 29,887,621 bp
=== First read: Adapter 1 ===
Sequence: GTGCCAGCMGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 102365 times.
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1
Overview of removed sequences
length count expect max.err error counts
18 33 0.0 1 15 18
19 88 0.0 1 87 1
24 4 0.0 1 2 2
26 15 0.0 1 0 15
33 1 0.0 1 1
59 1 0.0 1 1
68 1 0.0 1 1
70 2 0.0 1 2
79 2 0.0 1 2
86 1 0.0 1 1
91 4 0.0 1 4
93 19 0.0 1 17 2
99 2 0.0 1 2
105 1 0.0 1 1
108 2 0.0 1 2
109 1 0.0 1 1
110 150 0.0 1 137 13
111 3 0.0 1 3
116 4 0.0 1 4
121 4 0.0 1 4
128 9 0.0 1 9
129 4 0.0 1 4
130 2 0.0 1 2
131 6 0.0 1 6
133 1 0.0 1 1
136 1 0.0 1 1
145 1 0.0 1 1
148 15 0.0 1 15
149 3 0.0 1 3
150 26 0.0 1 25 1
151 1313 0.0 1 1287 26
152 6 0.0 1 6
153 1 0.0 1 1
154 1 0.0 1 0 1
159 6 0.0 1 6
160 2 0.0 1 2
165 53 0.0 1 52 1
166 22 0.0 1 19 3
167 168 0.0 1 137 31
168 9188 0.0 1 8636 552
169 3227 0.0 1 3112 115
170 6542 0.0 1 6292 250
171 10107 0.0 1 9703 404
172 402 0.0 1 379 23
173 1203 0.0 1 1146 57
174 37 0.0 1 37
175 229 0.0 1 205 24
176 2337 0.0 1 2260 77
177 139 0.0 1 132 7
178 188 0.0 1 181 7
179 2 0.0 1 2
181 5 0.0 1 5
182 10 0.0 1 10
183 4 0.0 1 4
184 104 0.0 1 74 30
185 1901 0.0 1 1834 67
186 184 0.0 1 160 24
187 7359 0.0 1 7071 288
188 23632 0.0 1 23004 628
189 2585 0.0 1 2509 76
190 921 0.0 1 883 38
191 41 0.0 1 35 6
192 503 0.0 1 433 70
193 25388 0.0 1 24720 668
194 4094 0.0 1 3768 326
195 52 0.0 1 47 5
208 1 0.0 1 1
239 2 0.0 1 2
=== Second read: Adapter 2 ===
Sequence: GACTACNVGGGTATCTAATCC; Type: regular 5'; Length: 21; Trimmed: 105574 times.
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-21 bp: 2
Overview of removed sequences
length count expect max.err error counts
3 14 1666.6 0 14
12 1 0.0 1 1
16 1 0.0 1 0 1
17 2 0.0 1 2
18 4 0.0 1 2 2
19 49 0.0 1 3 4 42
20 845 0.0 2 64 665 116
21 104283 0.0 2 101870 1973 440
22 345 0.0 2 62 246 37
23 9 0.0 2 1 0 8
24 2 0.0 2 2
25 2 0.0 2 2
30 1 0.0 2 1
55 1 0.0 2 0 0 1
56 1 0.0 2 1
62 1 0.0 2 0 1
63 1 0.0 2 0 0 1
150 2 0.0 2 2
151 3 0.0 2 1 1 1
154 7 0.0 2 4 1 2
Command: cutadapt --cores 1 --error-rate 0.1 --times 1 --overlap 3 -o /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-Pro_2_L001_R1_001.fastq.gz -p /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-Pro_3_L001_R2_001.fastq.gz --front GTGCCAGCMGCCGCGGTAA -G GACTACNVGGGTATCTAATCC /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-Pro_2_L001_R1_001.fastq.gz /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-Pro_3_L001_R2_001.fastq.gz
This is cutadapt 1.16 with Python 3.5.5
Command line parameters: --cores 1 --error-rate 0.1 --times 1 --overlap 3 -o /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-Pro_2_L001_R1_001.fastq.gz -p /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-Pro_3_L001_R2_001.fastq.gz --front GTGCCAGCMGCCGCGGTAA -G GACTACNVGGGTATCTAATCC /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-Pro_2_L001_R1_001.fastq.gz /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-Pro_3_L001_R2_001.fastq.gz
Running on 1 core
Trimming 2 adapters with at most 10.0% errors in paired-end mode ...
Finished in 4.61 s (41 us/read; 1.45 M reads/minute).
=== Summary ===
Total read pairs processed: 111,659
Read 1 with adapter: 107,578 (96.3%)
Read 2 with adapter: 110,724 (99.2%)
Pairs written (passing filters): 111,659 (100.0%)
Total basepairs processed: 67,218,718 bp
Read 1: 33,609,359 bp
Read 2: 33,609,359 bp
Total written (filtered): 45,266,596 bp (67.3%)
Read 1: 13,984,409 bp
Read 2: 31,282,187 bp
=== First read: Adapter 1 ===
Sequence: GTGCCAGCMGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 107578 times.
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1
Overview of removed sequences
length count expect max.err error counts
18 49 0.0 1 16 33
19 85 0.0 1 84 1
24 1 0.0 1 0 1
26 12 0.0 1 0 12
59 1 0.0 1 1
70 1 0.0 1 1
71 1 0.0 1 1
75 1 0.0 1 1
79 2 0.0 1 2
83 1 0.0 1 1
86 1 0.0 1 1
91 6 0.0 1 6
92 1 0.0 1 1
93 65 0.0 1 64 1
95 2 0.0 1 0 2
96 1 0.0 1 1
99 1 0.0 1 1
105 1 0.0 1 1
108 1 0.0 1 0 1
109 1 0.0 1 1
110 146 0.0 1 137 9
111 4 0.0 1 4
115 1 0.0 1 1
116 5 0.0 1 4 1
121 8 0.0 1 7 1
125 1 0.0 1 1
128 15 0.0 1 15
129 3 0.0 1 3
130 3 0.0 1 3
131 1 0.0 1 1
133 2 0.0 1 2
134 1 0.0 1 1
135 1 0.0 1 1
139 1 0.0 1 1
141 1 0.0 1 1
144 1 0.0 1 1
145 3 0.0 1 3
147 2 0.0 1 2
148 5 0.0 1 5
149 1 0.0 1 1
150 22 0.0 1 19 3
151 1064 0.0 1 1040 24
152 5 0.0 1 5
153 2 0.0 1 0 2
154 2 0.0 1 0 2
158 1 0.0 1 1
159 10 0.0 1 10
164 1 0.0 1 1
165 63 0.0 1 58 5
166 33 0.0 1 33
167 188 0.0 1 156 32
168 10213 0.0 1 9607 606
169 3417 0.0 1 3307 110
170 7962 0.0 1 7641 321
171 10517 0.0 1 10155 362
172 1051 0.0 1 1003 48
173 1191 0.0 1 1143 48
174 78 0.0 1 74 4
175 352 0.0 1 326 26
176 2073 0.0 1 2015 58
177 106 0.0 1 101 5
178 208 0.0 1 200 8
179 2 0.0 1 2
180 1 0.0 1 1
181 4 0.0 1 4
182 9 0.0 1 9
183 9 0.0 1 7 2
184 90 0.0 1 63 27
185 1705 0.0 1 1632 73
186 168 0.0 1 156 12
187 7418 0.0 1 7085 333
188 23584 0.0 1 22965 619
189 2655 0.0 1 2561 94
190 918 0.0 1 883 35
191 46 0.0 1 46
192 471 0.0 1 391 80
193 27326 0.0 1 26619 707
194 4109 0.0 1 3745 364
195 55 0.0 1 52 3
196 3 0.0 1 3
202 1 0.0 1 1
206 1 0.0 1 1
214 1 0.0 1 1
222 2 0.0 1 2
234 1 0.0 1 1
236 1 0.0 1 1
=== Second read: Adapter 2 ===
Sequence: GACTACNVGGGTATCTAATCC; Type: regular 5'; Length: 21; Trimmed: 110724 times.
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-21 bp: 2
Overview of removed sequences
length count expect max.err error counts
3 7 1744.7 0 7
6 1 27.3 0 1
12 1 0.0 1 0 1
16 1 0.0 1 0 1
18 2 0.0 1 0 2
19 41 0.0 1 0 2 39
20 948 0.0 2 61 760 127
21 109345 0.0 2 106849 2017 479
22 339 0.0 2 64 221 54
23 7 0.0 2 1 0 6
24 5 0.0 2 2 1 2
25 1 0.0 2 1
28 1 0.0 2 1
49 1 0.0 2 0 1
60 1 0.0 2 0 0 1
61 3 0.0 2 0 0 3
147 8 0.0 2 6 1 1
150 1 0.0 2 0 0 1
151 2 0.0 2 1 1
153 1 0.0 2 0 1
154 8 0.0 2 5 3 - output continues...
Saved SampleData[PairedEndSequencesWithQuality] to: Pro-pe-demux-primerTrim.qza
(qiime2-2018.6) wsb255bioimac27:Issue_0033_cutadapt_qiime2 mel_local$ qiime demux summarize --i-data Pro-pe-demux-primerTrim.qza --o-visualization Pro-pe-demux-primerTrim.qzv
Saved Visualization to: Pro-pe-demux-primerTrim.qzv
New trim demux attached. Pro-pe-demux-primerTrim.qzv Pro-pe-demux-primerTrim.qzv (292.6 KB)
So then I viewed the results and it's difficult for me to figure out if it worked? From the summary at the beginning of verbose - I think it worked? Then opening the .qzv First you notice the quality on the forward read set sucks - but that may make sense as we did cut off all the of the first 174 bps and if you look at the original Pro demux quality starts to falter-ish around 174 bps in...
Is there a way to check that indeed I am looking at around 291 base-pairs after the trimming versus before aside from exporting the file back to fastq/fasta, opening in VIM and looking at average sequence length - because in the demux qzv's both max out at 301? I supposed I'm confused as to how I verify cutadapt did what it was supposed to such that I now have sequence fragments around 291 bps.
FYI I discussed the workflows for processing 16S amplicons if differing sizes with @wasade as I'm comparing resulting impacts on diversity metrics and composition as it is unclear what the 'best practices' workflow might be in this situaiton and it sounds like quite a few people are wrestling with this. @wasade mentioned I could share the comparison with the forum and if there is interest I am happy to.
Workflows:
-
All data regardless of primer set uploaded into one artifact - dada2 using histograms as guide for quality trimming - q2-fragment insertion workflow - using resulting filtered table/tree for downstream q2 analysis.
-
Datasets uploaded separately - cleaned/denoised separately with dada2 based on quality determined by looking at demux.qzv histograms - merge denoised data - q2-fragment-insertion - using resulting filtered table/tree for downstream q2 analysis.
-
Datasets uploaded separately - primer trimmed separately using cutadapt to resulting fragments are all 291 bps per the shortest amplicon set (V4) - denoise with dada2 - merge denoised data - downstream q2 analysis.
I considered just doing a hard trim of the datasets in dada2 rather than primer cutting - but after reading some of the qiime updates from @Nicholas_Bokulich I opted out of that workflow, though I could still run it if it is of interest.
So far between workflows 1 and 2 it seems to matter if you upload all the data and process together versus uploading separately and merging after denoise followed by q2-fragment-insertion. Mostly between V4 and Pro results - I found differences in alpha and beta diversity measures and while the taxonomy compositions are similar the proportions are different.
Example looking at Observed OTUs between workflow 1 (obs-otus-group-significance.qzv (328.2 KB)) and workflow 2 (obs-otus-group-significance.qzv (356.5 KB))
Example looking at Faith-PD between workflow 1 (faith-pd-group-significance.qzv (333.8 KB)) and workflow 2 (faith-pd-group-significance.qzv (362.2 KB))
Perhaps the differences aren't appreciable - but I was showing them for several diversity measures - where one workflow was telling me for some diversity measures V4 was higher/significant but in the other workflow those same diversity measures were saying Pro was higher/significant - so workflow will impact your inferences, so while study design definitely plays a role, it also becomes a question of 'best practices' when dealing with datasets of differing length amplicons.
Apologies for the long post but whichever workflow is determined to be the most accurate (I still have to figure out how we 'decide' that) I can get will go into my publication so I want to make sure I've accounted for everything and that I've run the plugins correctly, especially since the result of trimming above was so low in average quality in the resulting demux.qzv, if that's what it is, ok, but I don't want the result to be because I ran something in error.
Thanks!
Mel