Demultiplexing cutadapt Error: Reads are improperly paired ; after arrange direction of reads

marioncdl · November 13, 2019, 12:33pm

Hi everyone,
I have some trouble with cutadapt for demultiplexing my files.

My data provides from metabarcoding sequences (Illumina), there look like :
Pair-end reads : 2 files R1 and R2
In each files we can found :
ADAPTER_BARCODE-FORWARD1_DNAsequence_BARCODE-REVERSE_ADAPTER
ADAPTER_BARCODE-REVERSE_DNAsequence_BARCODE-FORWARD1_ADAPTER1
...
I have several adapters and barcodes in each files and my sequences have several direction, it's a bit tricky.

The first step I made, is to arrange my sequences for having one direction :
ADAPTER_BARCODE-FORWARD1_DNAsequence_BARCODE-REVERSE_ADAPTER

Script in case I have 3 Barcodes forward/reverse :
AMORCE...= The sequence of my adapter and the barcorde

Summary

AMORCEF1=$(awk '($14 == "1") && ($15 == "F") {print $16}' $NAMESEQTXT)
AMORCER1=$(awk '($14== "1") && ($15 == "R") {print $16}' $NAMESEQTXT)
...
FICHIER_R1=$(find -name "*R1*")
FICHIER_R2=$(find -name "*R2*")
FUNTRIM_R1_STEP1="Banque${i}_R1_untrimmed_step1.fastq.gz"
FUNTRIM_R2_STEP1="Banque${i}_R2_untrimmed_step1.fastq.gz"
...
#1ere amorce				
echo "Etape 1/6"
echo $FICHIER_R1 $FICHIER_R2 >>$SUMMARY 2>&1
cutadapt -g $AMORCER1 -G $AMORCEF1 $FICHIER_R1 $FICHIER_R2 \
	--untrimmed-output $FUNTRIM_R1_STEP1 \
	--untrimmed-paired-output $FUNTRIM_R2_STEP1 \
	--action=none -o $FTRIM_R1_STEP1 -p $FTRIM_R2_STEP1 \
	>>$SUMMARY 2>&1

echo "Etape 2/6"
echo $FUNTRIM_R1_STEP1 $FUNTRIM_R2_STEP1 >>$SUMMARY 2>&1
cutadapt -g $AMORCEF1 -G $AMORCER1 $FUNTRIM_R1_STEP1 $FUNTRIM_R2_STEP1 \
	--untrimmed-output $FUNTRIM_R1_STEP2 \
	--untrimmed-paired-output $FUNTRIM_R2_STEP2 \
	--action=none -o $FTRIM_R1_STEP2 -p $FTRIM_R2_STEP2 \
	>>$SUMMARY 2>&1
					
#2eme amorce
echo "Etape 3/6"
echo $FUNTRIM_R1_STEP2 $FUNTRIM_R2_STEP2 >>$SUMMARY 2>&1
cutadapt -g $AMORCER2 -G $AMORCEF2 $FUNTRIM_R1_STEP2 $FUNTRIM_R2_STEP2 \
	--untrimmed-output $FUNTRIM_R1_STEP3 \
	--untrimmed-paired-output $FUNTRIM_R2_STEP3 \
	--action=none -o $FTRIM_R1_STEP3 -p $FTRIM_R2_STEP3 \
	>>$SUMMARY 2>&1
...
gunzip $FTRIM_R1_STEP2 ; gunzip $FTRIM_R2_STEP2
UNZIPR1="Banque${i}_R1_trimmed_step2.fastq"
UNZIPR2="Banque${i}_R2_trimmed_step2.fastq"
					R1RECCOMP="Banque${i}_R1_trimmed_step2_rev_comp.fastq.gz"
R2RECCOMP="Banque${i}_R2_trimmed_step2_rev_comp.fastq.gz"

gunzip $FTRIM_R1_STEP4 ; gunzip $FTRIM_R2_STEP4
UNZIPR14="Banque${i}_R1_trimmed_step4.fastq"
UNZIPR24="Banque${i}_R2_trimmed_step4.fastq"
					R1RECCOMP4="Banque${i}_R1_trimmed_step4_rev_comp.fastq.gz"
R2RECCOMP4="Banque${i}_R2_trimmed_step4_rev_comp.fastq.gz"

gunzip $FTRIM_R1_STEP6 ; gunzip $FTRIM_R2_STEP6
UNZIPR16="Banque${i}_R1_trimmed_step6.fastq"
UNZIPR26="Banque${i}_R2_trimmed_step6.fastq"
					R1RECCOMP6="Banque${i}_R1_trimmed_step6_rev_comp.fastq.gz"
R2RECCOMP6="Banque${i}_R2_trimmed_step6_rev_comp.fastq.gz"

fastx_reverse_complement -z -i $UNZIPR1 -o $R1RECCOMP >>$SUMMARY 2>&1
fastx_reverse_complement -z -i $UNZIPR2 -o $R2RECCOMP >>$SUMMARY 2>&1
fastx_reverse_complement -z -i $UNZIPR14 -o $R1RECCOMP4 >>$SUMMARY 2>&1
fastx_reverse_complement -z -i $UNZIPR24 -o $R2RECCOMP4 >>$SUMMARY 2>&1
fastx_reverse_complement -z -i $UNZIPR16 -o $R1RECCOMP6 >>$SUMMARY 2>&1
fastx_reverse_complement -z -i $UNZIPR26 -o $R2RECCOMP6 >>$SUMMARY 2>&1
RetourCode=${?}
if (( $RetourCode == 0 ))
then
cat $FTRIM_R1_STEP1 >forward.fastq.gz ;cat $R1RECCOMP >>forward.fastq.gz
cat $FTRIM_R1_STEP3 >>forward.fastq.gz;cat $R1RECCOMP4 >>forward.fastq.gz
cat $FTRIM_R1_STEP5 >>forward.fastq.gz;cat $R1RECCOMP6 >>forward.fastq.gz
cat $FTRIM_R2_STEP1 >reverse.fastq.gz ;cat $R2RECCOMP >>reverse.fastq.gz
cat $FTRIM_R2_STEP3 >>forward.fastq.gz;cat $R2RECCOMP4 >>forward.fastq.gz
cat $FTRIM_R2_STEP5 >>forward.fastq.gz;cat $R2RECCOMP6 >>forward.fastq.gz
mv forward.fastq.gz $NAMESEQ ; mv reverse.fastq.gz $NAMESEQ

This step works fine, I have no errors.

Then I import my data :

(i=40)
printf '/n========== Banque $i ============' >>$SUMMARY2 2>&1
printf '/n IMPORTATION /n' >>$SUMMARY2 2>&1				
qiime tools import \
	--type MultiplexedPairedEndBarcodeInSequence \
 	--input-path $NAMESEQ \
	--output-path ${NAMESEQ}.qza \
	>>$SUMMARY2 2>&1

RESULT : Imported B40seq as MultiplexedPairedEndBarcodeInSequenceDirFmt to B40seq.qza

The problem appear when I try to demultiplexe :

printf '/n DEMULTIPLEXAGE /n' >>$SUMMARY2 2>&1					
qiime cutadapt demux-paired \
	--i-seqs ${NAMESEQ}.qza \
	--m-forward-barcodes-file $NAMESEQTXT \
	--m-forward-barcodes-column Demul_seq \
	--p-error-rate 0 \
	--o-per-sample-sequences demultiplexed-seqs-${NAME}.qza \
       --o-untrimmed-sequences unknown_tag_${NAME}.qza \
       --verbose \
	>>$SUMMARY2 2>&1

Command: cutadapt --front file:/tmp/tmpgyaizcep --error-rate 0.0 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-xauw14d7/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-780z1m65/forward.fastq.gz -p /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-xauw14d7/{name}.2.fastq.gz --untrimmed-paired-output /tmp/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-780z1m65/reverse.fastq.gz /tmp/qiime2-archive-nd_84ip6/71c7dec0-9f63-48fd-a815-835290a57175/data/forward.fastq.gz /tmp/qiime2-archive-nd_84ip6/71c7dec0-9f63-48fd-a815-835290a57175/data/reverse.fastq.gz

This is cutadapt 1.18 with Python 3.6.7
Command line parameters: --front file:/tmp/tmpgyaizcep --error-rate 0.0 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-xauw14d7/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-780z1m65/forward.fastq.gz -p /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-xauw14d7/{name}.2.fastq.gz --untrimmed-paired-output /tmp/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-780z1m65/reverse.fastq.gz /tmp/qiime2-archive-nd_84ip6/71c7dec0-9f63-48fd-a815-835290a57175/data/forward.fastq.gz /tmp/qiime2-archive-nd_84ip6/71c7dec0-9f63-48fd-a815-835290a57175/data/reverse.fastq.gz
Processing reads on 1 core in paired-end legacy mode ...
**cutadapt: error: Reads are improperly paired. There are more reads in file 1 than in file 2.**

I have understand that I haven't the same numbers of sequences in my files R1 and R2, but I didn't how to fix it ...
(I have looked in the forward and reverse fasta files, the names of my sequences still corresponding...).

Have you I idea how I can manage this ?

PS : When I demultiplexe my files without out arrange my data, its works and also when I arrange my data but there are just one Barcorde forward and reverse...

thermokarst · November 18, 2019, 6:08pm

Hi @marioncdl, I am reclassifying this to "other bioinformatics tools", because it looks like the script you provided above is generating invalid fastq file-pairs:

marioncdl:

AMORCEF1=$(awk '($14 == "1") && ($15 == "F") {print $16}' $NAMESEQTXT) AMORCER1=$(awk '($14== "1") && ($15 == "R") {print $16}' $NAMESEQTXT) ... FICHIER_R1=$(find -name "*R1*") FICHIER_R2=$(find -name "*R2*") FUNTRIM_R1_STEP1="Banque${i}_R1_untrimmed_step1.fastq.gz" FUNTRIM_R2_STEP1="Banque${i}_R2_untrimmed_step1.fastq.gz" ... #1ere amorce echo "Etape 1/6" echo $FICHIER_R1 $FICHIER_R2 >>$SUMMARY 2>&1 cutadapt -g $AMORCER1 -G $AMORCEF1 $FICHIER_R1 $FICHIER_R2 \ --untrimmed-output $FUNTRIM_R1_STEP1 \ --untrimmed-paired-output $FUNTRIM_R2_STEP1 \ --action=none -o $FTRIM_R1_STEP1 -p $FTRIM_R2_STEP1 \ >>$SUMMARY 2>&1 echo "Etape 2/6" echo $FUNTRIM_R1_STEP1 $FUNTRIM_R2_STEP1 >>$SUMMARY 2>&1 cutadapt -g $AMORCEF1 -G $AMORCER1 $FUNTRIM_R1_STEP1 $FUNTRIM_R2_STEP1 \ --untrimmed-output $FUNTRIM_R1_STEP2 \ --untrimmed-paired-output $FUNTRIM_R2_STEP2 \ --action=none -o $FTRIM_R1_STEP2 -p $FTRIM_R2_STEP2 \ >>$SUMMARY 2>&1 #2eme amorce echo "Etape 3/6" echo $FUNTRIM_R1_STEP2 $FUNTRIM_R2_STEP2 >>$SUMMARY 2>&1 cutadapt -g $AMORCER2 -G $AMORCEF2 $FUNTRIM_R1_STEP2 $FUNTRIM_R2_STEP2 \ --untrimmed-output $FUNTRIM_R1_STEP3 \ --untrimmed-paired-output $FUNTRIM_R2_STEP3 \ --action=none -o $FTRIM_R1_STEP3 -p $FTRIM_R2_STEP3 \ >>$SUMMARY 2>&1 ... gunzip $FTRIM_R1_STEP2 ; gunzip $FTRIM_R2_STEP2 UNZIPR1="Banque${i}_R1_trimmed_step2.fastq" UNZIPR2="Banque${i}_R2_trimmed_step2.fastq" R1RECCOMP="Banque${i}_R1_trimmed_step2_rev_comp.fastq.gz" R2RECCOMP="Banque${i}_R2_trimmed_step2_rev_comp.fastq.gz" gunzip $FTRIM_R1_STEP4 ; gunzip $FTRIM_R2_STEP4 UNZIPR14="Banque${i}_R1_trimmed_step4.fastq" UNZIPR24="Banque${i}_R2_trimmed_step4.fastq" R1RECCOMP4="Banque${i}_R1_trimmed_step4_rev_comp.fastq.gz" R2RECCOMP4="Banque${i}_R2_trimmed_step4_rev_comp.fastq.gz" gunzip $FTRIM_R1_STEP6 ; gunzip $FTRIM_R2_STEP6 UNZIPR16="Banque${i}_R1_trimmed_step6.fastq" UNZIPR26="Banque${i}_R2_trimmed_step6.fastq" R1RECCOMP6="Banque${i}_R1_trimmed_step6_rev_comp.fastq.gz" R2RECCOMP6="Banque${i}_R2_trimmed_step6_rev_comp.fastq.gz" fastx_reverse_complement -z -i $UNZIPR1 -o $R1RECCOMP >>$SUMMARY 2>&1 fastx_reverse_complement -z -i $UNZIPR2 -o $R2RECCOMP >>$SUMMARY 2>&1 fastx_reverse_complement -z -i $UNZIPR14 -o $R1RECCOMP4 >>$SUMMARY 2>&1 fastx_reverse_complement -z -i $UNZIPR24 -o $R2RECCOMP4 >>$SUMMARY 2>&1 fastx_reverse_complement -z -i $UNZIPR16 -o $R1RECCOMP6 >>$SUMMARY 2>&1 fastx_reverse_complement -z -i $UNZIPR26 -o $R2RECCOMP6 >>$SUMMARY 2>&1 RetourCode=${?} if (( $RetourCode == 0 )) then cat $FTRIM_R1_STEP1 >forward.fastq.gz ;cat $R1RECCOMP >>forward.fastq.gz cat $FTRIM_R1_STEP3 >>forward.fastq.gz;cat $R1RECCOMP4 >>forward.fastq.gz cat $FTRIM_R1_STEP5 >>forward.fastq.gz;cat $R1RECCOMP6 >>forward.fastq.gz cat $FTRIM_R2_STEP1 >reverse.fastq.gz ;cat $R2RECCOMP >>reverse.fastq.gz cat $FTRIM_R2_STEP3 >>forward.fastq.gz;cat $R2RECCOMP4 >>forward.fastq.gz cat $FTRIM_R2_STEP5 >>forward.fastq.gz;cat $R2RECCOMP6 >>forward.fastq.gz mv forward.fastq.gz $NAMESEQ ; mv reverse.fastq.gz $NAMESEQ

Once you get your data in order we can help you with any issues you might be having in QIIME 2, but for now, this issue appears to be caused by an issue in this script generating mismatched file pairs.