Q2-cutadapt of primer sequences flag verification?

mmelendrez · August 28, 2018, 3:49pm

Greetings qiime2 forum once again! I am in the process of testing workflows for processing data from 16S amplicons of different sizes as a result of amplification with different primer sets (V4, a general prokaryotic set, and a more specific set). All over lap in the region 515F to 805R.

V4 - 515F/806R - yields a 291 bp amplicon
Pro - 341F/805R - yields a 464 bp amplicon
Arc - 515F/915R - yields a 400 bp amplicon

In searching the forum about workflows to process all these datasets - there were three workflows suggested per the following links along with some caveats to keep in mind via @Nicholas_Bokulich:

I've done two workflows with q2-insertion with different preprocessing prior and now I'm working on a workflow that trims of the primers, separate denoising/cleaning with dada2, followed by merge and q2 analysis.

I'd like to verify that I am utilizing the cutadapt plugin correctly - as the flags for the plug in are a bit different than the cutadapt docs. My data is already demultiplexed so I'm basically looking to cut off the ends by primer to get all reads regardless of dataset down to 291 bps (the size of the smallest amplicon overlapping region).

My Pro primers amplify a region from 341F - 805R. My V4 region amplifies starting at 515, not 341, so I basically need to cut off ~174 bps. Per @Nicholas_Bokulich suggestion I wasn't going to do a --p-trim-left-f of 174 because of the differences in amplicon length and instead I was going to trim based on the V4 primer sequence which starts at 515. So of course this assumes that my Pro dataset has the sequence that matches around 515 to my V4 primer. I think this is a fairly safe assumption since it's all 16S and all bacteria - all same samples from the same exact experiment - just sequenced with different primers.

I uploaded all data into a demux.qza fine. I viewed the original Pro-pe-demux.qzv not trimmed with cutadapt (see attached).Pro-pe-demux.qzv (287.4 KB). Typically at this point, if I was processing a single qza, I'd run dada2 after looking at the histograms and trim or trunc based on where quality starts to tank. But instead I'm cutting primers off...

My run of it - Using the V4 primer sequence which starts at 515F (GTGCCAGCMGCCGCGGTAA) to trim off the Pro dataset that starts at 341. And then the normal Pro primer 805R (GACTACNVGGGTATCTAATCC) (FYI - I'm pasting the entire output of the run as I did it in verbose) - My goal again is to reduce this sequence to 291 bps - so I can eventually combine it with the V4 dataset which is already at 291 bps after denoising. I will eventually have to repeat this process for the Arc dataset - but before I launch into that I wanted to make sure what I have now is making sense...

(qiime2-2018.6) wsb255bioimac27:Issue_0033_cutadapt_qiime2 mel_local$ qiime cutadapt trim-paired --p-front-f GTGCCAGCMGCCGCGGTAA --p-front-r GACTACNVGGGTATCTAATCC --i-demultiplexed-sequences Pro-pe-demux.qza --o-trimmed-sequences Pro-pe-demux-primerTrim.qza --verbose

Running external command line application. This may print messages to stdout and/or stderr.

The commands to be run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: cutadapt --cores 1 --error-rate 0.1 --times 1 --overlap 3 -o /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-5-Pro_0_L001_R1_001.fastq.gz -p /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-5-Pro_1_L001_R2_001.fastq.gz --front GTGCCAGCMGCCGCGGTAA -G GACTACNVGGGTATCTAATCC /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-5-Pro_0_L001_R1_001.fastq.gz /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-5-Pro_1_L001_R2_001.fastq.gz

This is cutadapt 1.16 with Python 3.5.5

Command line parameters: --cores 1 --error-rate 0.1 --times 1 --overlap 3 -o /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-5-Pro_0_L001_R1_001.fastq.gz -p /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-5-Pro_1_L001_R2_001.fastq.gz --front GTGCCAGCMGCCGCGGTAA -G GACTACNVGGGTATCTAATCC /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-5-Pro_0_L001_R1_001.fastq.gz /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-5-Pro_1_L001_R2_001.fastq.gz

Running on 1 core

Trimming 2 adapters with at most 10.0% errors in paired-end mode ...

Finished in 4.45 s (42 us/read; 1.44 M reads/minute).

=== Summary ===

Total read pairs processed: 106,663

Read 1 with adapter: 102,365 (96.0%)

Read 2 with adapter: 105,574 (99.0%)

Pairs written (passing filters): 106,663 (100.0%)

Total basepairs processed: 64,211,126 bp

Read 1: 32,105,563 bp

Read 2: 32,105,563 bp

Total written (filtered): 43,295,188 bp (67.4%)

Read 1: 13,407,567 bp

Read 2: 29,887,621 bp

=== First read: Adapter 1 ===

Sequence: GTGCCAGCMGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 102365 times.

No. of allowed errors:

0-9 bp: 0; 10-19 bp: 1

Overview of removed sequences

length count expect max.err error counts

18 33 0.0 1 15 18

19 88 0.0 1 87 1

24 4 0.0 1 2 2

26 15 0.0 1 0 15

33 1 0.0 1 1

59 1 0.0 1 1

68 1 0.0 1 1

70 2 0.0 1 2

79 2 0.0 1 2

86 1 0.0 1 1

91 4 0.0 1 4

93 19 0.0 1 17 2

99 2 0.0 1 2

105 1 0.0 1 1

108 2 0.0 1 2

109 1 0.0 1 1

110 150 0.0 1 137 13

111 3 0.0 1 3

116 4 0.0 1 4

121 4 0.0 1 4

128 9 0.0 1 9

129 4 0.0 1 4

130 2 0.0 1 2

131 6 0.0 1 6

133 1 0.0 1 1

136 1 0.0 1 1

145 1 0.0 1 1

148 15 0.0 1 15

149 3 0.0 1 3

150 26 0.0 1 25 1

151 1313 0.0 1 1287 26

152 6 0.0 1 6

153 1 0.0 1 1

154 1 0.0 1 0 1

159 6 0.0 1 6

160 2 0.0 1 2

165 53 0.0 1 52 1

166 22 0.0 1 19 3

167 168 0.0 1 137 31

168 9188 0.0 1 8636 552

169 3227 0.0 1 3112 115

170 6542 0.0 1 6292 250

171 10107 0.0 1 9703 404

172 402 0.0 1 379 23

173 1203 0.0 1 1146 57

174 37 0.0 1 37

175 229 0.0 1 205 24

176 2337 0.0 1 2260 77

177 139 0.0 1 132 7

178 188 0.0 1 181 7

179 2 0.0 1 2

181 5 0.0 1 5

182 10 0.0 1 10

183 4 0.0 1 4

184 104 0.0 1 74 30

185 1901 0.0 1 1834 67

186 184 0.0 1 160 24

187 7359 0.0 1 7071 288

188 23632 0.0 1 23004 628

189 2585 0.0 1 2509 76

190 921 0.0 1 883 38

191 41 0.0 1 35 6

192 503 0.0 1 433 70

193 25388 0.0 1 24720 668

194 4094 0.0 1 3768 326

195 52 0.0 1 47 5

208 1 0.0 1 1

239 2 0.0 1 2

=== Second read: Adapter 2 ===

Sequence: GACTACNVGGGTATCTAATCC; Type: regular 5'; Length: 21; Trimmed: 105574 times.

No. of allowed errors:

0-9 bp: 0; 10-19 bp: 1; 20-21 bp: 2

Overview of removed sequences

length count expect max.err error counts

3 14 1666.6 0 14

12 1 0.0 1 1

16 1 0.0 1 0 1

17 2 0.0 1 2

18 4 0.0 1 2 2

19 49 0.0 1 3 4 42

20 845 0.0 2 64 665 116

21 104283 0.0 2 101870 1973 440

22 345 0.0 2 62 246 37

23 9 0.0 2 1 0 8

24 2 0.0 2 2

25 2 0.0 2 2

30 1 0.0 2 1

55 1 0.0 2 0 0 1

56 1 0.0 2 1

62 1 0.0 2 0 1

63 1 0.0 2 0 0 1

150 2 0.0 2 2

151 3 0.0 2 1 1 1

154 7 0.0 2 4 1 2

Command: cutadapt --cores 1 --error-rate 0.1 --times 1 --overlap 3 -o /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-Pro_2_L001_R1_001.fastq.gz -p /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-Pro_3_L001_R2_001.fastq.gz --front GTGCCAGCMGCCGCGGTAA -G GACTACNVGGGTATCTAATCC /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-Pro_2_L001_R1_001.fastq.gz /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-Pro_3_L001_R2_001.fastq.gz

This is cutadapt 1.16 with Python 3.5.5

Command line parameters: --cores 1 --error-rate 0.1 --times 1 --overlap 3 -o /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-Pro_2_L001_R1_001.fastq.gz -p /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-a2htise_/F1-0-Pro_3_L001_R2_001.fastq.gz --front GTGCCAGCMGCCGCGGTAA -G GACTACNVGGGTATCTAATCC /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-Pro_2_L001_R1_001.fastq.gz /var/folders/12/2j8hq03s52lbnstw5wh008k80000gq/T/qiime2-archive-uqp_1rni/058b8b62-5ee5-482e-b0c9-e46258ad4833/data/F1-0-Pro_3_L001_R2_001.fastq.gz

Running on 1 core

Trimming 2 adapters with at most 10.0% errors in paired-end mode ...

Finished in 4.61 s (41 us/read; 1.45 M reads/minute).

=== Summary ===

Total read pairs processed: 111,659

Read 1 with adapter: 107,578 (96.3%)

Read 2 with adapter: 110,724 (99.2%)

Pairs written (passing filters): 111,659 (100.0%)

Total basepairs processed: 67,218,718 bp

Read 1: 33,609,359 bp

Read 2: 33,609,359 bp

Total written (filtered): 45,266,596 bp (67.3%)

Read 1: 13,984,409 bp

Read 2: 31,282,187 bp

=== First read: Adapter 1 ===

Sequence: GTGCCAGCMGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 107578 times.

No. of allowed errors:

0-9 bp: 0; 10-19 bp: 1

Overview of removed sequences

length count expect max.err error counts

18 49 0.0 1 16 33

19 85 0.0 1 84 1

24 1 0.0 1 0 1

26 12 0.0 1 0 12

59 1 0.0 1 1

70 1 0.0 1 1

71 1 0.0 1 1

75 1 0.0 1 1

79 2 0.0 1 2

83 1 0.0 1 1

86 1 0.0 1 1

91 6 0.0 1 6

92 1 0.0 1 1

93 65 0.0 1 64 1

95 2 0.0 1 0 2

96 1 0.0 1 1

99 1 0.0 1 1

105 1 0.0 1 1

108 1 0.0 1 0 1

109 1 0.0 1 1

110 146 0.0 1 137 9

111 4 0.0 1 4

115 1 0.0 1 1

116 5 0.0 1 4 1

121 8 0.0 1 7 1

125 1 0.0 1 1

128 15 0.0 1 15

129 3 0.0 1 3

130 3 0.0 1 3

131 1 0.0 1 1

133 2 0.0 1 2

134 1 0.0 1 1

135 1 0.0 1 1

139 1 0.0 1 1

141 1 0.0 1 1

144 1 0.0 1 1

145 3 0.0 1 3

147 2 0.0 1 2

148 5 0.0 1 5

149 1 0.0 1 1

150 22 0.0 1 19 3

151 1064 0.0 1 1040 24

152 5 0.0 1 5

153 2 0.0 1 0 2

154 2 0.0 1 0 2

158 1 0.0 1 1

159 10 0.0 1 10

164 1 0.0 1 1

165 63 0.0 1 58 5

166 33 0.0 1 33

167 188 0.0 1 156 32

168 10213 0.0 1 9607 606

169 3417 0.0 1 3307 110

170 7962 0.0 1 7641 321

171 10517 0.0 1 10155 362

172 1051 0.0 1 1003 48

173 1191 0.0 1 1143 48

174 78 0.0 1 74 4

175 352 0.0 1 326 26

176 2073 0.0 1 2015 58

177 106 0.0 1 101 5

178 208 0.0 1 200 8

179 2 0.0 1 2

180 1 0.0 1 1

181 4 0.0 1 4

182 9 0.0 1 9

183 9 0.0 1 7 2

184 90 0.0 1 63 27

185 1705 0.0 1 1632 73

186 168 0.0 1 156 12

187 7418 0.0 1 7085 333

188 23584 0.0 1 22965 619

189 2655 0.0 1 2561 94

190 918 0.0 1 883 35

191 46 0.0 1 46

192 471 0.0 1 391 80

193 27326 0.0 1 26619 707

194 4109 0.0 1 3745 364

195 55 0.0 1 52 3

196 3 0.0 1 3

202 1 0.0 1 1

206 1 0.0 1 1

214 1 0.0 1 1

222 2 0.0 1 2

234 1 0.0 1 1

236 1 0.0 1 1

=== Second read: Adapter 2 ===

Sequence: GACTACNVGGGTATCTAATCC; Type: regular 5'; Length: 21; Trimmed: 110724 times.

No. of allowed errors:

0-9 bp: 0; 10-19 bp: 1; 20-21 bp: 2

Overview of removed sequences

length count expect max.err error counts

3 7 1744.7 0 7

6 1 27.3 0 1

12 1 0.0 1 0 1

16 1 0.0 1 0 1

18 2 0.0 1 0 2

19 41 0.0 1 0 2 39

20 948 0.0 2 61 760 127

21 109345 0.0 2 106849 2017 479

22 339 0.0 2 64 221 54

23 7 0.0 2 1 0 6

24 5 0.0 2 2 1 2

25 1 0.0 2 1

28 1 0.0 2 1

49 1 0.0 2 0 1

60 1 0.0 2 0 0 1

61 3 0.0 2 0 0 3

147 8 0.0 2 6 1 1

150 1 0.0 2 0 0 1

151 2 0.0 2 1 1

153 1 0.0 2 0 1

154 8 0.0 2 5 3 - output continues...

Saved SampleData[PairedEndSequencesWithQuality] to: Pro-pe-demux-primerTrim.qza

(qiime2-2018.6) wsb255bioimac27:Issue_0033_cutadapt_qiime2 mel_local$ qiime demux summarize --i-data Pro-pe-demux-primerTrim.qza --o-visualization Pro-pe-demux-primerTrim.qzv

Saved Visualization to: Pro-pe-demux-primerTrim.qzv

New trim demux attached. Pro-pe-demux-primerTrim.qzv Pro-pe-demux-primerTrim.qzv (292.6 KB)

So then I viewed the results and it's difficult for me to figure out if it worked? From the summary at the beginning of verbose - I think it worked? Then opening the .qzv First you notice the quality on the forward read set sucks - but that may make sense as we did cut off all the of the first 174 bps and if you look at the original Pro demux quality starts to falter-ish around 174 bps in...

Is there a way to check that indeed I am looking at around 291 base-pairs after the trimming versus before aside from exporting the file back to fastq/fasta, opening in VIM and looking at average sequence length - because in the demux qzv's both max out at 301? I supposed I'm confused as to how I verify cutadapt did what it was supposed to such that I now have sequence fragments around 291 bps.

FYI I discussed the workflows for processing 16S amplicons if differing sizes with @wasade as I'm comparing resulting impacts on diversity metrics and composition as it is unclear what the 'best practices' workflow might be in this situaiton and it sounds like quite a few people are wrestling with this. @wasade mentioned I could share the comparison with the forum and if there is interest I am happy to.

Workflows:

All data regardless of primer set uploaded into one artifact - dada2 using histograms as guide for quality trimming - q2-fragment insertion workflow - using resulting filtered table/tree for downstream q2 analysis.
Datasets uploaded separately - cleaned/denoised separately with dada2 based on quality determined by looking at demux.qzv histograms - merge denoised data - q2-fragment-insertion - using resulting filtered table/tree for downstream q2 analysis.
Datasets uploaded separately - primer trimmed separately using cutadapt to resulting fragments are all 291 bps per the shortest amplicon set (V4) - denoise with dada2 - merge denoised data - downstream q2 analysis.

I considered just doing a hard trim of the datasets in dada2 rather than primer cutting - but after reading some of the qiime updates from @Nicholas_Bokulich I opted out of that workflow, though I could still run it if it is of interest.

So far between workflows 1 and 2 it seems to matter if you upload all the data and process together versus uploading separately and merging after denoise followed by q2-fragment-insertion. Mostly between V4 and Pro results - I found differences in alpha and beta diversity measures and while the taxonomy compositions are similar the proportions are different.

Example looking at Observed OTUs between workflow 1 (obs-otus-group-significance.qzv (328.2 KB)) and workflow 2 (obs-otus-group-significance.qzv (356.5 KB))

Example looking at Faith-PD between workflow 1 (faith-pd-group-significance.qzv (333.8 KB)) and workflow 2 (faith-pd-group-significance.qzv (362.2 KB))

Perhaps the differences aren't appreciable - but I was showing them for several diversity measures - where one workflow was telling me for some diversity measures V4 was higher/significant but in the other workflow those same diversity measures were saying Pro was higher/significant - so workflow will impact your inferences, so while study design definitely plays a role, it also becomes a question of 'best practices' when dealing with datasets of differing length amplicons.

Apologies for the long post but whichever workflow is determined to be the most accurate (I still have to figure out how we 'decide' that) I can get will go into my publication so I want to make sure I've accounted for everything and that I've run the plugins correctly, especially since the result of trimming above was so low in average quality in the resulting demux.qzv, if that's what it is, ok, but I don't want the result to be because I ran something in error.

Thanks!
Mel

Mehrbod_Estaki · August 30, 2018, 8:41pm

Hi @mmelendrez,
Woo,this is the longest post I’ve ever worked on but I’m not complaining this actually ended up being quite enjoyable as it forced me to dig deep into this topic that I’ve been neglecting for a while. As so, I imagine my reply is going to be quite long as well. Also thank you very much for your extremely thorough prior researching and explanations. As you probably know by now through reading previous posts this is not a well examined area of the field but it is quite important and is being increasingly recognized as an area which requires much improvements, for example the fragment-insertion method is one awesome tool designed to deal with this issue. That being said, and just so you know, anything I suggest here is based on personal opinion and experience and should be taken with a grain of salt . It would be great to hear other people’s thoughts and experience.
First my own opinion and philosophy regarding trimming to equal lengths. This to me is analogous to the theory behind rarefying wherein you throw away much of your hard earned data in order to be able to perform some task. While some argue this may be a necessary evil, others have argued that doing so actually removes us further away from the true target and we must instead adjust our approach to that task as to not have to discard data. A wave of recent tools with this latter approach in mind are those that make use of relative abundances instead of count data. In the same way, tools like fragment-insertion allow for inclusion of these data types without discarding valuable information. Of course there’s lot of work left to be done here but from a philosophical perspective I think we should always prioritize not throwing away data when possible. So with that in mind my own suggestion is to to take the best/easiest approach which is to keep all your lengths and use fragment-insertion and any questions which cannot be answered based on this approach would be limitations of your study and we should not demand too much more of the data.

BUT, the idea of doing some benchmarking like the one you’ve started is also extremely valuable and will be of great value, especially to curious but busy (read:lazy) people like myself who are unlikely to do these things themselves but still want to know! So let’s go down the rabbit hole.

Having gone through your workflow with cutadapt I think you are good to go and cutadapt is actually performing exactly as it should! We can tell this by a) looking through the output which is very informative and mentions that it gut rid of about 33% of the forward reads which is what we expected. You can also see this clearly in the post-trim qc plots, for example in the forward reads if you were to hover over the 0-130 bp range you’ll see that this is where most of your reads are, and after 130bp there is a massive drop and only about 500-600 reads remain with that length (note the information in the paragraph below the plots for this info). Those few reads correspond to reads that either didn’t have the expected primer or if they did they were different by more than 2 bp or so. It’s hard to know if these are real or artifacts, never the less there only a few of them. Same thing with the reverse reads, you can see that your entire plot has shifted to the left by about 20bp, and about 150bp didn’t get trimmed.
The one thing that could really improve this workflow though would be to somehow tell cutadapt (or some other tool) to discard any reads that do not have your specified primers within them. This probably would be pretty easy to do outside of qiime even prior to cutadapt with grep or something. But luckily these reads tend to have much lower quality scores anyways so they may end up getting filtered out during denoising anyways and we wouldn’t have to worry about them.
Other ways to deal with these would be to run the data against a positive filter of sorts, like lets say 65% greengenes, that way if they are true taxa we retain them otherwise they get tossed.
One more idea I had was to actually to denoise your reads with dada2 before cutadapt. This way dada2 can still build its error model with the full reads and you would pick truncating parameters as usual and then cutadapt. This might even get rid of those few odd reads as either noise or will correct their potential primer mismatches. The issue with this approach being you wouldn’t have representative sequences anymore but perhaps you can hack one with vsearch set at 100% identity clustering or something else.

I’m not surprised to see such differences in your alpha diversity results with these different approaches, as you mentioned workflow matters! To be honest, these days I’m becoming more skeptical of alpha diversity results anyways, but that’s a whole other topic on its own. In your case it would be pretty hard to determine which one is actually closer to the true data unless you have something of a mock community or even simulated community to compare your different methods too. In those cases you can use q2-quality-control to evaluate those workflow.
When using data from these multiple regions I would personally recommend sticking to phylogenetic-based diversity measures as these tend to be more robust to these effects, i.e. use of fragment-insertion again in my mind championing the other methods.
Please do update us with whatever results you get out of these methods, I’m sure it would be useful to developers and other users.
There are a lot more I wanted to say on this topic but I’m very sleep deprived at the moment and so I’ll incubate on those thoughts for the time being. Good luck!

mmelendrez · August 31, 2018, 5:34pm

Thank you @Mehrbod_Estaki for your thoughtful reply!

So yes - I’ve been hedging the q2-fragment-insertion workflows…what was surprising to me was actually I have 2 workflows that use q2-fragment insertion. The first workflow I literally upload all data regardless of what primer was used (that info is in the metadata table). So all of my samples all primer conditions are uploaded as one artifact and treated as ‘one’. So I run dada2 after looking at the demux qza file and determine trim and trunc from there and so on all the way through dada2/q2-fragment insertion analysis - that was workflow 1.

In workflow 2 the only difference is instead of uploading all of the datasets as ‘one’ unit/artifact and running that through dada2…I did each one separately. So I would upload V4 as a dataset, process through dada2; then I’d upload Pro as a dataset, process through dada2; then Arc…process through dada2 so in the end I have 3 sets of results…then I merge everything into one dataset artifact…then use q2-fragment insertion on that.

It’s between these two workflows I find the diversity/compositional inference differences. So the bonus of q2-fragment insertion is you don’t need equal lengths yes, but apparently…how or in what order? you even denoise the data prior to use of q2-fragment-insertion matters - that’s where I was finding the differences.

I would expect differences between a primer-based cutadapt processing follow by analysis versus a q2-fragment-insertion analysis. What I didn’t expect was conflicting results between q2-fragement insertion analyses just because I pre-processed the data through dada2 as one dataset versus three followed by a merge… Both used q2-fragment-insertion. But apparently when you denoise data of differing length fragments through dada2…what you get out for input into q2-fragment insertion is different than if you were to process all datasets separately in dada2 (dada2 is working with data that is all the same fragment amplified and you just do it for the 3 datasets) and then merge them after dada2 denoising.

To q2-fragment-insertion I’m guessing it’s all the same…denoised fragments of differing lengths. But to DADA2 is it all the same? Giving dada2 a dataset of fragments of different lengths to process rather than a dataset of all the same length? Downstream diversity and composition results would suggest…maybe it does matter.

Or maybe this is trivial? When I got conflicting plots downstream…that’s when I took a pause.

Happy to update as I go. Thanks for confirming my cutadapt usage!

And now I’ve written all this I’m thinking I probably should’ve split off into a different post regarding dada2 and denoising fragments of differing length.

Cheers,
Mel

Mehrbod_Estaki · August 31, 2018, 5:44pm

Hi @mmelendrez,

Are these 3 datasets from the same illumina run or from separate runs? You should only run them together if they are from the same run otherwise certainly run them separately. Dada2’s error model building step is based on one run. Since there is a lot of run-to-run variability it would not be appropriate to use the error model between these runs. So it’s not surprising for these different methods to produce different results but the correct one would be based on what I just mentioned above.

mmelendrez · August 31, 2018, 5:48pm

My understanding is yes, they were all run in the same Illumina run, hence running them altogether but I also ran them separately as I saw in many suggested protocols. So I guess I wasn’t expecting such differences in results.

mmelendrez · September 6, 2018, 2:24pm

Ok so I’ve adjusted the workflow for cutadapt processing/merging of files for downstream analysis. Another issue I came up with is…I’m running cutadapt on each dataset separately because I need to cut them all down to a 290 bp common fragment and to do that I have to run different primer sequences with the different datasets. So in the end I have 3 artifacts of trimmed demux’d data - one for Arc, Pro and V4 - all theoretically consisting of demux’d sequence data that is now cut down to size based on primer sites. I know there will be some garbage in there still but dada2 should clean that up when I QC based on histos.

Now all my datasets are from one Illumina MiSeq run - so I know I can process them as one entity in dada2 or I can process separately and merge using qiiime feature-table merge later on.

But I actually need to merge the data prior to running dada2…because my impression is IF I run dada2 on each dataset separately, I will have different QC trim/trunc params for each file most likely. This means after processing with dada2, each file, each run with different trunc/trim params based on QC of the histos I will have resulting files of differing lengths sequences yet again which means running q2-fragement-insertion and the whole point of the workflow was to get everything to one fragment/size, QC together and run QIIME2 without having to run fragment-insertion because again, I’m comparing end results of the different workflows.

SO…I was unable to find a qiime command that combines demux’d artifacts - so I am using q2-cutadapt on each dataset to get everything to the 290 bps based on primer site. Then I am exporting the artifacts using qiime tools export. That gives me back the fastq.gz files and MANIFEST. Then I will cat all MANIFEST’s and import all (now trimmed) fastq.gz datasets back into one qiime2 demux artifact which I can them start dada2 on.

How is the logic on this one?

Cheers - Mel

Mehrbod_Estaki · September 7, 2018, 3:20am

Hi @mmelendrez,

To be honest, I'm not sure how subsetting one MiSeq run into a few datasets and denoising them separately will affect the error model compared to running them all together. This is an interesting off-shoot of this discussion which I will look into and perhaps create a new topic and ping the dada2 creator for input I can't find the answer. But my guess is as long as there is enough reads to create an effective error model it should be ok. Though I think you are perhaps taking the more complicated way here, or maybe I'm missing something still. Why not just denoise your entire dataset first with dada2 using equal qc parameters, then separate them and use cutadapt to trim to equal regions? This way you would avoid all the export/reimporting hassle.

mmelendrez · September 7, 2018, 6:21pm

One Miseq run yes - but all different sized amplicons.

@Mehrbod_Estaki Though I think you are perhaps taking the more complicated way here, or maybe I’m missing something still. Why not just denoise your entire dataset first with dada2 using equal qc parameters, then separate them and use cutadapt to trim to equal regions?

So here is apparently what I must not be understanding then.

Whether I upload and denoise as one bulk dataset or 3 separate ones... Typically don't you look at histos of the demux'd artifact to decide trim/trunc yes? Ok so lets say I do that and run my data through so it's denoised and trim/trunc accordingly - how then are my primer sites available for cutadapt? The probability of them getting trimmed off or truncated off is probably pretty good right? So wouldn't most of my data no longer contain the sequencing information that would then match the primers to then cut everything to size accurately?

Am I not understanding mechanistically something with how dada2 works?

I would've thought you first want to run your primers against the raw data to clean out as much as you can and trim off based on primers because if you do quality trimming/truncating first then you get rid of data that would otherwise help you define your primer sites that need to match to trim everything to the right size.

OR are you suggesting I denoise with dada2 using trim and trunc parameters of zero (you mentioned equal qc parameters)? Get it denoised, then use cutadapt?

Yes, perhaps I'm overthinking this - perhaps I'm not understanding what dada2 is completely doing. Basically I'm worried that no matter what qc trim/trunc parameters I use (except for zero maybe) that I will be trimming/truncating the ends of my fragments. If I trim/truncate my fragments I am potentially removing primer recognition sites that I need in order to trim by primer site right? That is why I did cutadapt first to get everything down to size (hopefully) - then did the export/reorganize/import to run dada2.

Mel

Mehrbod_Estaki · September 7, 2018, 10:21pm

So after some more thinking and digging I realize now that I need to backtrack a bit here. My initial proposal of denoising first then using cutadapt seems to have originated from my fantasies and not reality. I realize you can’t actually use cutadapt on the output of dada2. While I do think this is probably the best approach, it isn’t actually possible in qiime2 , so my apologies for that! Even outside of qiime2, I think this approach would require some cutsom coding.

So going back to your original 3 workflows you tried, I think between option 1 and 2 is still the most appropriate. Which is better is something that is best discussed by the dada2 experts, I posted a similar inquiry here recently so let’s see what they think there. While you showed differences in observed OTUs allocated to different primers between the two workflows, the experimental parameters seem to be the same between the workflows, which is good news! Alpha diversity is rather finicky in my opinion in the first place so I wouldn’t use rely on that as a good comparison marker (a whole other discussion on its own). And as for beta diversity composition that of course would also change to some extend when you consider you may not be comparing the same lengths of the same taxa.

for example: A species of the bacteroides family may be assigned down to the species level when you have a 300bp read for it, but only be assigned to the genus level if you have 280bp of it. This stemming from using different trim/truncating parameters between your 2 workflows. They’re both correct technically, but one is a better estimate of the truth as it has more information.

As for the cutadapt>denoise approach, I don’t know if it would work properly or not to be honest. Personally I would avoid it. I have concerns about comparing V4 and Arc dataset to the Pro one since after trimming the Arc dataset your QC scores will be starting at a position where they have a rather lower score and I just don’t know how dada2 would perform on that. Might be ok, I just don’t know

Sorry couldn’t be more help!

mmelendrez · September 7, 2018, 10:58pm

Okey dokey - no worries. I was trying the cutadapt approach per a discussion with @wasade at ASM where he mentioned you may not actually ‘sacrifice’ that much if you just go with the smaller fragments compositionally speaking - so I thought I’d take a look and see. So ya perhaps you will have differences in resolution but for instance the unclassified bacteria bin may change between workflows. I’m using the same classifier for all workflows. Personally since fragment-insertion was designed for my scenario (and then some) I agree it’s probably the way to go. I won’t know in terms which workflow seems to be giving me the more ‘biologically’ accurate results until I compare all the metrics, which I will be doing this next week. Perhaps it doesn’t matter, and that’s good to know too right? I’ll post when I compile them because it would be lovely to get some forum input on which workflow would be considered least prone to being critique-killed in review.

system · October 9, 2018, 4:58am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.