Cutadapt error using trnL DNA sequences

Steven_Mamet · January 10, 2019, 10:53pm

Hello,

I've been attempting to trim primers and adapters from my fastq files without success. I've been using the following command on our server:

parallel --link --jobs 50 'cutadapt
--pair-filter any
--no-indels
--discard-untrimmed
-a TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
-A GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG
-g CGAAATYGGTAGACGCTACG
-G CCDTYGAGTCTCTGCACCTATC
-o primer_trimmed_fastqs/{1/}
-p primer_trimmed_fastqs/{2/}
{1} {2}
> primer_trimmed_fastqs/{1/}_cutadapt_log.txt' ::: raw_reads/_R1.fastq.gz ::: raw_reads/_R2.fastq.gz

This has worked previously on other trnL sequences, but I consistently get the same error messages:

cutadapt: error: Reads are improperly paired. There are more reads in file 2 than in file 1.
cutadapt: error: Reads are improperly paired. There are more reads in file 2 than in file 1.
cutadapt: error: Reads are improperly paired. There are more reads in file 2 than in file 1.
cutadapt: error: Reads are improperly paired. There are more reads in file 2 than in file 1.
cutadapt: error: Reads are improperly paired. There are more reads in file 2 than in file 1.
cutadapt: error: Reads are improperly paired. There are more reads in file 2 than in file 1.
cutadapt: error: Reads are improperly paired. There are more reads in file 2 than in file 1.
cutadapt: error: Reads are improperly paired. There are more reads in file 2 than in file 1.
cutadapt: error: Reads are improperly paired. There are more reads in file 2 than in file 1.
cutadapt: error: Reads are improperly paired. There are more reads in file 2 than in file 1.

...and so on

I'm not sure why I'm getting this error. I came across this post among some similar ones, which mentions header issues or read count discrepancies:

https://github.com/marcelm/cutadapt/issues/197

However, when inspecting my fastq files, I see the whitespace doesn't seem to be an issue. Here is an example R1:

@M01666:85:000000000-BNT9H:1:1101:15781:1556 1:N:0:199
CGAAATCGGTAGACGCTACGGACTTAATTGGATTGAGCCTTGGTATGGAAACCTACTAAGTGATAACTTTCAAATTCAGAGAAACCCTGGAATTAACAATGGGCAATCCTGAGCCAAATCCTGGGTTACGCGAACAAACCGGAGTTTAGAA
+
?ABBBBFCCBCCGGGGGGGGGFEGHGFHHHHHHHHHHHHHGHHHGHGHGHGHHHHHHHGHGHHHHHHHHHGFHHGHHGHHHFFHHHGHHHGHHHHHHGGHHGGHHEHHHHHHHHHHHHGHHHHHEGGHHGGGGGGGHHHGGGGGFHHFGGH
@M01666:85:000000000-BNT9H:1:1101:13846:1644 1:N:0:199
CGAAATTGGTAGACGCTACGGACTTAATTGGATTGGGCCTTGGTATGGAAACCTGCTGAGTGAGAACTTTCAAATTCAGAGAAACCCTGGAATTAATAAAAAGGGGCAATCCTGAGCCAAATCCTATTTTTCGAAAACAAAGGTTTAGAAA

And its R2 complement:

@M01666:85:000000000-BNT9H:1:1101:15781:1556 2:N:0:199
CCATTGAGTCTCTGCACCTATCCCTTTTTTTCTCGCTTTCTAAACTCCGGTTTGTTCGCGTAACCCAGGATTTGGCTCAGGATTGCCCATTGTTAATTCCAGGGTTTCTCTGAATTTGAAAGTTATCACTTAGTAGGTTTCCATACCAAGG
+
BBBBBFFFFFFFGGGGGGGGGGHHHHHHHGGHHHGGGHGHHHFHHHHHGGGGGGFHHGGGGGGGHHGHGHHHHHHHHHHHHGHHHHHHGHHHHHGFHHHHHHHCGHEHHHHGGGHHHHHHHGEDHHHHHHHHHHHHGHHHHHHHHHHHHFF
@M01666:85:000000000-BNT9H:1:1101:13846:1644 2:N:0:199
CCTTTGAGTCTCTGCACCTATCCCCTTTTTCACTTTCTAAACCTTTGTTTTCGAAAAATAGGATTTGGCTCAGGATTGCCCCTTTTTATTAATTCCAGGGTTTCTCTGAATTTGAAAGTTCTCACTCAGCAGGTTTCCATACCAAGGCCCA

Does anyone have any advice here? I'm just baffled that this has worked on other sequences and now I couldn't get this to work to save my life. I'm guessing there's a user error somewhere there but I simply can't find it.

Thanks,

Steve

thermokarst · January 10, 2019, 11:30pm

Hey there @Steven_Mamet!

I moved this over to the Other Bioinformatics Tools - QIIME 2 Forum channel on our forum - since we aren't involved in the development of cutadapt at all - someone here might be able to answer. I would recommend submitting this issue to the official cutadapt support venue though (I think that is their GH issue tracker). Thanks!

Steven_Mamet · January 11, 2019, 12:06am

Sorry about that @thermokarst—it was the end of the day and I posted it in the wrong forum. Thanks for moving it over!

Micro_Biologist · January 12, 2019, 11:40am

Just going by your error

cutadapt: error: Reads are improperly paired. There are more reads in file 2 than in file 1.

It looks like one file has more reads, maybe just run:

wc -l file_name

for your fastq files and make sure that the F and R read files for each sample/index are the same?

Steven_Mamet · January 14, 2019, 3:48pm

Thanks @Micro_Biologist.

Here's an update: it turns out I had actually been using cutadapt 1.14. WIth that version only 30 pairs of the ~1100 fastq files I had would pass cutadapt. I checked several of the files that failed (including the two I mentioned earlier) and found they were incorrectly labeled as "improperly paired".

After updating to 1.18, I found now 151 pairs passed cutadapt. Inspecting the pairs that didn't pass this round showed those files did indeed have differing amounts of reads. So in a nutshell—I messed up by using an old version of cutadapt, but in the end found out some of the pairing errors were true, so perhaps there was some file corruption when those sequences came of the sequencer.

Micro_Biologist · January 16, 2019, 7:38am

Glad to hear its all sorted! Hopefully you can work out which files go with which (and which are actually what sample!)