Cutadapt not working on all sequences

David_Bradshaw · February 28, 2018, 3:12pm

Dear Whom It May Concern,

I ran the following script that mostly worked.

qiime cutadapt trim-paired --i-demultiplexed-sequences '/home/microbiology/Deblur/paired-end-demux.qza' --p-cores 8 --p-adapter-f ATTAGAWACCCBNGTAGTCC --p-front-f GTGYCAGCMGCCGCGGTAA --p-adapter-r TTACCGCGGCKGCTGRCAC --p-front-r GGACTACNVGGGTWTCTAAT --output-dir all-trimmed-primers –verbose

My files are combos of different runs most were at 2x250 but one was at 2x300 (5185). I am using EMP modified 515f-806r primers. Some of the sequences in 5185 have 515fMYSEQUENCE806r(compl and rev)OTHER STUFF, thus i ran the script to account for primers at the front and complementary and reversed primers at the end. Most of my sequences worked but the 515f primer was not removed at the beginning of almost all of the R1 reads in the 5185 runs. (I was unable to attach fastq files to show this, even zipped)

I tried to run another script to trim primer again and got this

qiime cutadapt trim-paired --i-demultiplexed-sequences '/home/microbiology/all-trimmed-primers/trimmed_sequences.qza' --p-cores 8 --p-front-f GTGYCAGCMGCCGCGGTAA --output-dir all-trimmed-primers-v2 –verbose

Running on 8 cores
Trimming 1 adapter with at most 10.0% errors in paired-end legacy mode ...
ERROR: Traceback (most recent call last):
File "/home/microbiology/miniconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/cutadapt/pipeline.py", line 454, in run
(n, bp1, bp2) = self._pipeline.process_reads()
File "/home/microbiology/miniconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/cutadapt/pipeline.py", line 282, in process_reads
for read1, read2 in self._reader:
File "/home/microbiology/miniconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/cutadapt/seqio.py", line 414, in iter
r2 = next(it2)
File "src/cutadapt/_seqio.pyx", line 234, in iter (src/cutadapt/_seqio.c:5816)
cutadapt.seqio.FormatError: FASTQ file ended prematurely

cutadapt: error: FASTQ file ended prematurely
Plugin error from cutadapt:

Command '['cutadapt', '--cores', '8', '--error-rate', '0.1', '--times', '1', '--overlap', '3', '-o', '/tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-uiap5h0f/5185-ME3D17W-515yF-806bR_354_L001_R1_001.fastq.gz', '-p', '/tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-uiap5h0f/5185-ME3D17W-515yF-806bR_355_L001_R2_001.fastq.gz', '--front', 'GTGYCAGCMGCCGCGGTAA', '/tmp/qiime2-archive-lce48j8j/ecece2ae-094b-4f3d-b055-731f29ca2153/data/5185-ME3D17W-515yF-806bR_354_L001_R1_001.fastq.gz', '/tmp/qiime2-archive-lce48j8j/ecece2ae-094b-4f3d-b055-731f29ca2153/data/5185-ME3D17W-515yF-806bR_355_L001_R2_001.fastq.gz']' returned non-zero exit status 1

Thank you for your time and help,

Sincerely,

David Bradshaw

thermokarst · February 28, 2018, 3:27pm

This file appears to be corrupted. Do you still have the source file laying around? If it is a fastq.gz file, you can test the archive validity like this:

gzip -tv /path/to/5185-ME3D17W-515yF-806bR_355_L001_R2_001.fastq.gz

If the file is valid (that command will return 'OK'), then you could run:

qiime tools validate /path/to/demuxed-seqs.qza --level max

This will validate the data as imported --- sometimes things are corrupted when moving around, so I am hoping we will see one of these two files give us a hint as to when/where it happened.

Once we get this sorted out we can figure out what is going on with the trimming issue you reported initially.

Thanks!

David_Bradshaw · February 28, 2018, 4:06pm

Dear Matthew Ryan Dillon,

Thank you for the help. I ran it and got the following:

microbiology@willow:~$ gzip -tv /path/to/5185-ME3D17W-515yF-806bR_355_L001_R2_001.fastq.gz
gzip: /path/to/5185-ME3D17W-515yF-806bR_355_L001_R2_001.fastq.gz: No such file or directory

Which is strange since I can open the original file.

I am also running the following script which seems to work better but still misses removing some primers when I check random samples for the primer sequences. Though I guess I cannot expect the program to find all of them? Still curious why original script did not work for 5185 reads?

(qiime2-2018.2) microbiology@willow:~$ qiime cutadapt trim-paired --i-demultiplexed-sequences '/home/microbiology/Deblur/paired-end-demux.qza' --p-cores 8 --p-adapter-f GTGYCAGCMGCCGCGGTAA...ATTAGAWACCCBNGTAGTCC --p-adapter-r GGACTACNVGGGTWTCTAAT...TTACCGCGGCKGCTGRCAC --output-dir all-trimmed-primers3 --verbose

=== Summary ===

Total read pairs processed: 15,169
Read 1 with adapter: 14,934 (98.5%)
Read 2 with adapter: 14,835 (97.8%)
Pairs written (passing filters): 15,169 (100.0%)

Total basepairs processed: 7,602,429 bp
Read 1: 3,802,722 bp
Read 2: 3,799,707 bp
Total written (filtered): 7,022,944 bp (92.4%)
Read 1: 3,519,915 bp
Read 2: 3,503,029 bp

=== First read: Adapter 2 ===

Sequence: GTGYCAGCMGCCGCGGTAA...ATTAGAWACCCBNGTAGTCC; Type: linked; Length: 19+20; 5' trimmed: 14934 times; 3' trimmed: 0 times

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20 bp: 2

Overview of removed sequences at 5' end
length count expect max.err error counts
18 957 0.0 1 0 957
19 13959 0.0 1 12476 1483
20 18 0.0 1 0 18

Overview of removed sequences at 3' end
length count expect max.err error counts

=== Second read: Adapter 5 ===

Sequence: GGACTACNVGGGTWTCTAAT...TTACCGCGGCKGCTGRCAC; Type: linked; Length: 20+19; 5' trimmed: 14835 times; 3' trimmed: 74 times

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20 bp: 2

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1

Overview of removed sequences at 5' end
length count expect max.err error counts
18 25 0.0 1 0 0 25
19 220 0.0 1 0 211 9
20 14573 0.0 2 14114 371 88
21 17 0.0 2 0 16 1

Overview of removed sequences at 3' end
length count expect max.err error counts
3 65 237.0 0 65
4 9 59.3 0 9

Saved SampleData[PairedEndSequencesWithQuality] to: all-trimmed-primers3/trimmed_sequences.qza

Visualizations pre and post trimming attached
trimmed_seqs3.qzv (296.2 KB)
paired-end-demux.qzv (289.0 KB)

Sincerely,

David Bradshaw

thermokarst · February 28, 2018, 6:43pm

Oops! I should've made this more clear, but /path/to/.... should be updated to refer to wherever you have those files stashed (it is pretty unlikely you have directories on your system called path, to, etc.). Thanks!

David_Bradshaw · March 1, 2018, 4:35pm

Dear Matthew Dillon,

microbiology@willow:~$ gzip -tv all-trimmed-primers/trimmed_sequences.qza/ecece2ae-094b-4f3d-b055-731f29ca2153/data/5185-ME3D17W-515F-806R_355_L001_R2_001.fastqgz

gzip: all-trimmed-primers/trimmed_sequences.qza/ecece2ae-094b-4f3d-b055-731f29ca2153/data/5185-ME3D17W-515F-806R_355_L001_R2_001.fastqgz: Not a directory

(qiime2-2018.2) microbiology@willow:~$ qiime tools validate '/home/microbiology/Deblur/paired-end-demux.qza' --level max
Artifact /home/microbiology/Deblur/paired-end-demux.qza appears to be valid at level=max.

I ran the scripts and got the above.

I have honestly gone and used TrimGalore to take care of my sequences and have imported into QIIME2.

I removed the first 19 sequences of R1 reads, the first 20 of R2 reads, and trimmed the first 50 bases of the ends of the 5185 reads to make them more like the other runs. Imported things well, lost about 20% of reads during join pairs, and am working through rest of workflow.

Does that sound like a sufficient strategy?

Thank you for your time and help,

Sincerely,

David Bradshaw

thermokarst · March 1, 2018, 10:03pm

Hey @David_Bradshaw - looks like you basically just did the same thing over again - you ran that test command against a file that doesn't really exist (looking at it, I suspect there is a typo, you are missing a . between fastq and gz, but that is just a guess).

Yep! Whatever gets the job done!

Thanks!

system · April 2, 2018, 4:04am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.