Cutadapt behaving differently inside and outside of qiime2

jcmcnch · June 18, 2018, 4:25pm

Hi,

I've encountered a weird issue that I can't figure out - hopefully one of you can help!

First off, here is the version info etc:
QIIME2 = qiime2-2018.2
Install method = conda
cutadapt version = 1.15 (as part of same conda env)

Here's what I did:

Imported 37 paired-end fastq data into qiime 2

> 
> qiime tools import \
>   --type 'SampleData[PairedEndSequencesWithQuality]' \
>   --input-path manifest-BACT.csv \
>   --output-path POTATOE-BACT.qza \
>   --source-format PairedEndFastqManifestPhred33

Visualized the quality of forward and reverse reads

> qiime demux summarize \
> --i-data POTATOE-BACT.qza \
> --output-dir POTATOE-BACT_quality \
> --verbose

Ran cutadapt to trim primers (redirecting cutadapt's std output to a file):

> qiime cutadapt trim-paired \
> --i-demultiplexed-sequences POTATOE-BACT.qza \
> --p-front-f CCTACGGGNGGCWGCAG \
> --p-front-r GACTACHVGGGTATCTAATCC \
> --o-trimmed-sequences POTATOE-BACT_primers_trimmed.qza \
> --verbose

Issue encountered: Looking at the stdout from cutadapt, 6 files out of 37 said no primers were detected. I checked manually by opening up the qza artifact file, and sure enough the primers were there in the fastq files. A simple grep search showed that most of the reads had the fwd/rev primer in them.

Troubleshooting steps:

Tried importing two of the samples separately (1 that worked and 1 that didn't). This time, the issue didn't occur... the file that failed before had its primers detected successfully.
Tested same samples (from beginning) with cutadapt standalone. No issue here.
Extracted one of the sequence files that failed with qiime2 and one that succeeded from the qza artefact. Reran with cutadapt. No issues.

Upon closer inspection of the logs, it appears that somehow the CLI arguments are getting messed up such that R1 and R2 are switched in these samples in qiime2 (see below) which would explain why it's not finding primers. I don't understand how this can be the case as the manifest file I used for the initial batch test looks fine to me (pasted at the very bottom) and it works for the majority of samples. So to me, it seems like qiime2 is importing it correctly according to the manifest but somehow the instructions to cutadapt are getting garbled. But I would love this to be a simple mistake on my part but I just can't see where it is... I can provide the sequence files privately but prefer not to share them openly as they are not yet published.

Thanks in advance for your help,

Jesse

Log for failed file (batch import of 37 samples to qiime2 then cutadapt run on all together with qiime2):

> This is cutadapt 1.15 with Python 3.5.5
> Command line parameters: --cores 1 --error-rate 0.1 --times 1 --overlap 3 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-r6zhwz2c/stationP5_41_L001_R2_001.fastq.gz -p /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-r6zhwz2c/stationP5_4_L001_R1_001.fastq.gz --front CCTACGGGNGGCWGCAG -G GACTACHVGGGTATCTAATCC /tmp/qiime2-archive-gwcyl56w/08e4e960-f54e-4038-86e5-486610afe00b/data/stationP5_41_L001_R2_001.fastq.gz /tmp/qiime2-archive-gwcyl56w/08e4e960-f54e-4038-86e5-486610afe00b/data/stationP5_4_L001_R1_001.fastq.gz
> Running on 1 core
> Trimming 2 adapters with at most 10.0% errors in paired-end mode ...
> Finished in 18.84 s (111 us/read; 0.54 M reads/minute).
> 
> === Summary ===
> 
> Total read pairs processed:            170,296
>   Read 1 with adapter:                       3 (0.0%)
>   Read 2 with adapter:                       6 (0.0%)
> Pairs written (passing filters):       170,296 (100.0%)
> 
> Total basepairs processed:    85,488,592 bp
>   Read 1:    42,744,296 bp
>   Read 2:    42,744,296 bp
> Total written (filtered):     85,488,496 bp (100.0%)
>   Read 1:    42,744,254 bp
>   Read 2:    42,744,242 bp
> 
> === First read: Adapter 1 ===
> 
> Sequence: CCTACGGGNGGCWGCAG; Type: regular 5'; Length: 17; Trimmed: 3 times.
> 
> No. of allowed errors:
> 0-9 bp: 0; 10-17 bp: 1
> 
> Overview of removed sequences
> length	count	expect	max.err	error counts
> 8	1	2.6	0	1
> 17	2	0.0	1	2
> 
> === Second read: Adapter 2 ===
> 
> Sequence: GACTACHVGGGTATCTAATCC; Type: regular 5'; Length: 21; Trimmed: 6 times.
> 
> No. of allowed errors:
> 0-9 bp: 0; 10-19 bp: 1; 20-21 bp: 2
> 
> Overview of removed sequences
> length	count	expect	max.err	error counts
> 3	4	2660.9	0	4
> 21	2	0.0	2	0 2

Log for successful run where I imported the failed file in a smaller batch (only 2 samples) and it worked inexplicably:

> This is cutadapt 1.15 with Python 3.5.5
> Command line parameters: -a CCTACGGGNGGCWGCAG -A GACTACHVGGGTATCTAATCC -o stationP5outR1.fastq.gz -p stationP5outR2.fastq.gz stationP5_4_L001_R1_001.fastq.gz stationP5_41_L001_R2_001.fastq.gz
> Running on 1 core
> Trimming 2 adapters with at most 10.0% errors in paired-end mode ...
> Finished in 14.72 s (86 us/read; 0.69 M reads/minute).
> 
> === Summary ===
> 
> Total read pairs processed:            170,296
>   Read 1 with adapter:                 169,781 (99.7%)
>   Read 2 with adapter:                 169,584 (99.6%)
> Pairs written (passing filters):       170,296 (100.0%)
> 
> Total basepairs processed:    85,488,592 bp
>   Read 1:    42,744,296 bp
>   Read 2:    42,744,296 bp
> Total written (filtered):        312,231 bp (0.4%)
>   Read 1:       131,759 bp
>   Read 2:       180,472 bp
> 
> === First read: Adapter 1 ===
> 
> Sequence: CCTACGGGNGGCWGCAG; Type: regular 3'; Length: 17; Trimmed: 169781 times.
> 
> No. of allowed errors:
> 0-9 bp: 0; 10-17 bp: 1
> 
> Bases preceding removed adapters:
>   A: 0.0%
>   C: 0.0%
>   G: 0.0%
>   T: 0.0%
>   none/other: 100.0%
> 
> Overview of removed sequences
> length	count	expect	max.err	error counts
> 3	6	2660.9	0	6
> 4	4	665.2	0	4
> 250	18	0.0	1	15 3
> 251	169753	0.0	1	162055 7698
> 
> === Second read: Adapter 2 ===
> 
> Sequence: GACTACHVGGGTATCTAATCC; Type: regular 3'; Length: 21; Trimmed: 169584 times.
> 
> No. of allowed errors:
> 0-9 bp: 0; 10-19 bp: 1; 20-21 bp: 2
> 
> Bases preceding removed adapters:
>   A: 0.0%
>   C: 0.0%
>   G: 0.0%
>   T: 0.0%
>   none/other: 100.0%
> 
> Overview of removed sequences
> length	count	expect	max.err	error counts
> 3	4	2660.9	0	4
> 5	1	166.3	0	1
> 9	2	0.6	0	0 2
> 244	1	0.0	2	1
> 248	4	0.0	2	0 0 4
> 250	19	0.0	2	10 9
> 251	169553	0.0	2	164116 5030 407

Log for successful run of same file with standalone cutadapt:

> This is cutadapt 1.15 with Python 3.5.5
> Command line parameters: --discard-untrimmed -g CCTACGGGNGGCWGCAG -G GACTACHVGGGTATCTAATCC -o 3237-WHJ-0005_S5_L001_R1_primers-trimmed.fastq.gz -p 3237-WHJ-0005_S5_L001_R2_primers-trimmed.fastq.gz 3237-WHJ-0005_S5_L001_R1_001.fastq.gz 3237-WHJ-0005_S5_L001_R2_001.fastq.gz
> Running on 1 core
> Trimming 2 adapters with at most 10.0% errors in paired-end mode ...
> Finished in 14.24 s (84 us/read; 0.72 M reads/minute).
> 
> === Summary ===
> 
> Total read pairs processed:            170,296
>   Read 1 with adapter:                 169,895 (99.8%)
>   Read 2 with adapter:                 169,605 (99.6%)
> Pairs written (passing filters):       169,282 (99.4%)
> 
> Total basepairs processed:    85,488,592 bp
>   Read 1:    42,744,296 bp
>   Read 2:    42,744,296 bp
> Total written (filtered):     78,550,377 bp (91.9%)
>   Read 1:    39,613,870 bp
>   Read 2:    38,936,507 bp
> 
> === First read: Adapter 1 ===
> 
> Sequence: CCTACGGGNGGCWGCAG; Type: regular 5'; Length: 17; Trimmed: 169895 times.
> 
> No. of allowed errors:
> 0-9 bp: 0; 10-17 bp: 1
> 
> Overview of removed sequences
> length	count	expect	max.err	error counts
> 3	85	2660.9	0	85
> 11	1	0.0	1	1
> 14	4	0.0	1	2 2
> 15	24	0.0	1	9 15
> 16	943	0.0	1	203 740
> 17	168558	0.0	1	162055 6503
> 18	280	0.0	1	15 265
> 
> === Second read: Adapter 2 ===
> 
> Sequence: GACTACHVGGGTATCTAATCC; Type: regular 5'; Length: 21; Trimmed: 169605 times.
> 
> No. of allowed errors:
> 0-9 bp: 0; 10-19 bp: 1; 20-21 bp: 2
> 
> Overview of removed sequences
> length	count	expect	max.err	error counts
> 15	2	0.0	1	0 2
> 16	3	0.0	1	0 3
> 17	13	0.0	1	10 3
> 18	20	0.0	1	3 17
> 19	103	0.0	1	4 5 94
> 20	1748	0.0	2	86 1604 58
> 21	167289	0.0	2	164116 2972 201
> 22	420	0.0	2	10 379 31
> 23	2	0.0	2	0 0 2
> 25	4	0.0	2	0 0 4
> 28	1	0.0	2	1

My manifest file:

> sample-id,absolute-filepath,direction
> stationP1,$PWD/BACT-341F-805R/3237-WHJ-0001_S1_L001_R1_001.fastq.gz,forward
> stationP2,$PWD/BACT-341F-805R/3237-WHJ-0002_S2_L001_R1_001.fastq.gz,forward
> stationP3,$PWD/BACT-341F-805R/3237-WHJ-0003_S3_L001_R1_001.fastq.gz,forward
> stationP4,$PWD/BACT-341F-805R/3237-WHJ-0004_S4_L001_R1_001.fastq.gz,forward
> stationP5,$PWD/BACT-341F-805R/3237-WHJ-0005_S5_L001_R1_001.fastq.gz,forward
> stationP6,$PWD/BACT-341F-805R/3237-WHJ-0006_S6_L001_R1_001.fastq.gz,forward
> stationP7,$PWD/BACT-341F-805R/3237-WHJ-0007_S7_L001_R1_001.fastq.gz,forward
> stationP8,$PWD/BACT-341F-805R/3237-WHJ-0008_S8_L001_R1_001.fastq.gz,forward
> stationP9,$PWD/BACT-341F-805R/3237-WHJ-0009_S9_L001_R1_001.fastq.gz,forward
> stationP10,$PWD/BACT-341F-805R/3237-WHJ-0010_S10_L001_R1_001.fastq.gz,forward
> stationP11,$PWD/BACT-341F-805R/3237-WHJ-0011_S11_L001_R1_001.fastq.gz,forward
> stationP12,$PWD/BACT-341F-805R/3237-WHJ-0012_S12_L001_R1_001.fastq.gz,forward
> stationP13,$PWD/BACT-341F-805R/3237-WHJ-0013_S13_L001_R1_001.fastq.gz,forward
> stationP14,$PWD/BACT-341F-805R/3237-WHJ-0014_S14_L001_R1_001.fastq.gz,forward
> stationP15,$PWD/BACT-341F-805R/3237-WHJ-0015_S15_L001_R1_001.fastq.gz,forward
> stationP16,$PWD/BACT-341F-805R/3237-WHJ-0016_S16_L001_R1_001.fastq.gz,forward
> stationP17,$PWD/BACT-341F-805R/3237-WHJ-0017_S17_L001_R1_001.fastq.gz,forward
> stationP18,$PWD/BACT-341F-805R/3237-WHJ-0018_S18_L001_R1_001.fastq.gz,forward
> stationP20,$PWD/BACT-341F-805R/3237-WHJ-0019_S19_L001_R1_001.fastq.gz,forward
> stationP21,$PWD/BACT-341F-805R/3237-WHJ-0020_S20_L001_R1_001.fastq.gz,forward
> stationP22,$PWD/BACT-341F-805R/3237-WHJ-0021_S21_L001_R1_001.fastq.gz,forward
> stationP23,$PWD/BACT-341F-805R/3237-WHJ-0022_S22_L001_R1_001.fastq.gz,forward
> stationP24,$PWD/BACT-341F-805R/3237-WHJ-0023_S23_L001_R1_001.fastq.gz,forward
> stationP25,$PWD/BACT-341F-805R/3237-WHJ-0024_S24_L001_R1_001.fastq.gz,forward
> stationP26,$PWD/BACT-341F-805R/3237-WHJ-0025_S25_L001_R1_001.fastq.gz,forward
> stationP27,$PWD/BACT-341F-805R/3237-WHJ-0026_S26_L001_R1_001.fastq.gz,forward
> stationP28,$PWD/BACT-341F-805R/3237-WHJ-0027_S27_L001_R1_001.fastq.gz,forward
> stationP29,$PWD/BACT-341F-805R/3237-WHJ-0028_S28_L001_R1_001.fastq.gz,forward
> stationP30,$PWD/BACT-341F-805R/3237-WHJ-0029_S29_L001_R1_001.fastq.gz,forward
> stationP31,$PWD/BACT-341F-805R/3237-WHJ-0030_S30_L001_R1_001.fastq.gz,forward
> stationM1,$PWD/BACT-341F-805R/3237-WHJ-0031_S31_L001_R1_001.fastq.gz,forward
> stationM2,$PWD/BACT-341F-805R/3237-WHJ-0032_S32_L001_R1_001.fastq.gz,forward
> stationM3,$PWD/BACT-341F-805R/3237-WHJ-0033_S33_L001_R1_001.fastq.gz,forward
> stationM4,$PWD/BACT-341F-805R/3237-WHJ-0034_S34_L001_R1_001.fastq.gz,forward
> stationM5,$PWD/BACT-341F-805R/3237-WHJ-0035_S35_L001_R1_001.fastq.gz,forward
> stationM6,$PWD/BACT-341F-805R/3237-WHJ-0036_S36_L001_R1_001.fastq.gz,forward
> stationM7,$PWD/BACT-341F-805R/3237-WHJ-0037_S37_L001_R1_001.fastq.gz,forward
> stationP1,$PWD/BACT-341F-805R/3237-WHJ-0001_S1_L001_R2_001.fastq.gz,reverse
> stationP2,$PWD/BACT-341F-805R/3237-WHJ-0002_S2_L001_R2_001.fastq.gz,reverse
> stationP3,$PWD/BACT-341F-805R/3237-WHJ-0003_S3_L001_R2_001.fastq.gz,reverse
> stationP4,$PWD/BACT-341F-805R/3237-WHJ-0004_S4_L001_R2_001.fastq.gz,reverse
> stationP5,$PWD/BACT-341F-805R/3237-WHJ-0005_S5_L001_R2_001.fastq.gz,reverse
> stationP6,$PWD/BACT-341F-805R/3237-WHJ-0006_S6_L001_R2_001.fastq.gz,reverse
> stationP7,$PWD/BACT-341F-805R/3237-WHJ-0007_S7_L001_R2_001.fastq.gz,reverse
> stationP8,$PWD/BACT-341F-805R/3237-WHJ-0008_S8_L001_R2_001.fastq.gz,reverse
> stationP9,$PWD/BACT-341F-805R/3237-WHJ-0009_S9_L001_R2_001.fastq.gz,reverse
> stationP10,$PWD/BACT-341F-805R/3237-WHJ-0010_S10_L001_R2_001.fastq.gz,reverse
> stationP11,$PWD/BACT-341F-805R/3237-WHJ-0011_S11_L001_R2_001.fastq.gz,reverse
> stationP12,$PWD/BACT-341F-805R/3237-WHJ-0012_S12_L001_R2_001.fastq.gz,reverse
> stationP13,$PWD/BACT-341F-805R/3237-WHJ-0013_S13_L001_R2_001.fastq.gz,reverse
> stationP14,$PWD/BACT-341F-805R/3237-WHJ-0014_S14_L001_R2_001.fastq.gz,reverse
> stationP15,$PWD/BACT-341F-805R/3237-WHJ-0015_S15_L001_R2_001.fastq.gz,reverse
> stationP16,$PWD/BACT-341F-805R/3237-WHJ-0016_S16_L001_R2_001.fastq.gz,reverse
> stationP17,$PWD/BACT-341F-805R/3237-WHJ-0017_S17_L001_R2_001.fastq.gz,reverse
> stationP18,$PWD/BACT-341F-805R/3237-WHJ-0018_S18_L001_R2_001.fastq.gz,reverse
> stationP20,$PWD/BACT-341F-805R/3237-WHJ-0019_S19_L001_R2_001.fastq.gz,reverse
> stationP21,$PWD/BACT-341F-805R/3237-WHJ-0020_S20_L001_R2_001.fastq.gz,reverse
> stationP22,$PWD/BACT-341F-805R/3237-WHJ-0021_S21_L001_R2_001.fastq.gz,reverse
> stationP23,$PWD/BACT-341F-805R/3237-WHJ-0022_S22_L001_R2_001.fastq.gz,reverse
> stationP24,$PWD/BACT-341F-805R/3237-WHJ-0023_S23_L001_R2_001.fastq.gz,reverse
> stationP25,$PWD/BACT-341F-805R/3237-WHJ-0024_S24_L001_R2_001.fastq.gz,reverse
> stationP26,$PWD/BACT-341F-805R/3237-WHJ-0025_S25_L001_R2_001.fastq.gz,reverse
> stationP27,$PWD/BACT-341F-805R/3237-WHJ-0026_S26_L001_R2_001.fastq.gz,reverse
> stationP28,$PWD/BACT-341F-805R/3237-WHJ-0027_S27_L001_R2_001.fastq.gz,reverse
> stationP29,$PWD/BACT-341F-805R/3237-WHJ-0028_S28_L001_R2_001.fastq.gz,reverse
> stationP30,$PWD/BACT-341F-805R/3237-WHJ-0029_S29_L001_R2_001.fastq.gz,reverse
> stationP31,$PWD/BACT-341F-805R/3237-WHJ-0030_S30_L001_R2_001.fastq.gz,reverse
> stationM1,$PWD/BACT-341F-805R/3237-WHJ-0031_S31_L001_R2_001.fastq.gz,reverse
> stationM2,$PWD/BACT-341F-805R/3237-WHJ-0032_S32_L001_R2_001.fastq.gz,reverse
> stationM3,$PWD/BACT-341F-805R/3237-WHJ-0033_S33_L001_R2_001.fastq.gz,reverse
> stationM4,$PWD/BACT-341F-805R/3237-WHJ-0034_S34_L001_R2_001.fastq.gz,reverse
> stationM5,$PWD/BACT-341F-805R/3237-WHJ-0035_S35_L001_R2_001.fastq.gz,reverse
> stationM6,$PWD/BACT-341F-805R/3237-WHJ-0036_S36_L001_R2_001.fastq.gz,reverse
> stationM7,$PWD/BACT-341F-805R/3237-WHJ-0037_S37_L001_R2_001.fastq.gz,reverse

thermokarst · June 19, 2018, 1:09pm

Hey there @jcmcnch! Fortunately, this looks like a data-related problem, and not an issue specifically with q2-cutadapt vs cutadapt. As you mentioned, it looks like somehow, when running your full dataset through q2-cutadapt, when it gets to a certain point in sample processing, the R1 and R2 files get swapped around, which is why trimming isn't working in q2-cutadapt as you would expect. I suspect that you might not have a matched set of reads for a sample, which would produce an off-by-one error (although validation of the formats should've caught this...).

I wonder if you could send us a link to take a look at your demuxed seqs (you can DM to me if you don't want to share publicly). I want to troubleshoot this a bit more to determine if this is a format problem, or some other processing issue in the boilerplate code around q2-cutadapt. Rest assured though, cutadapt is behaving just fine, standalone or in q2-cutadapt --- it is just getting "the wrong data" in q2-cutadapt, so to speak.

Thanks! :qiime2:

jcmcnch · June 19, 2018, 4:05pm

Hi Matthew,

Thanks very much for looking into this, I'll send you a private message with my demuxed seqs, my manifest and the steps I used to import it.

Jesse

thermokarst · June 19, 2018, 11:18pm

Hey there @jcmcnch! Thanks so much for sending along your data. TLDR - you identified a bug!

Please take a read through there to learn a bit more about the underlying problems.

The next release will be available soon (scheduled for this Thursday) - the fix will be available then.

system · July 21, 2018, 5:18am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.