Demultiplexed-seqs.qza is invalid

Hi everyone, I'm new to QIIME2- and bioinformatics in general- so I've been following this protocol and these instructions for demultiplexing. We're trying to look at paired-end sequences from Illumina. After importing the data and demultiplexing, the output of qiime tools validate showed that the resulting demultiplexed-sequences file has empty sequences. I'm not sure where or how sequences got deleted, or how big of an underlying problem it is.
Here's the error message:

Result demultiplexed-seqs.qza does not appear to be valid at level=max:

      /tmp/qiime2-archive-eqvu81tf/b28f664d-8b9e-4767-aa3f-45a6579f691c/data/PB_375_ATATCG_L001_R1_001.fastq.gz is not a(n) FastqGzFormat file:

      Missing sequence for record beginning on line 49

Here's the exact commands I ran:
Starting with the raw files from the sequencing center (the only thing I did to them was rename to forward and reverse). And the metadata file is attached. 4246A_Metadata.txt (386 Bytes)

source activate qiime2-2018.11

qiime tools import --type
MultiplexedPairedEndBarcodeInSequence --input-path /bioinf/home/acastill/Bioinf_16S_Oct2019/import_4246A --output-path /bioinf/home/acastill/Bioinf_16S_Oct2019/Edits_4246A_Feb2020/multiplexed-seqs.qza

And multiplexed-seqs.qza is a valid file. Then demultiplexing:

qiime cutadapt demux-paired --i-seqs /bioinf/home/acastill/Bioinf_16S_Oct2019/Edits_4246A_Feb2020/multiplexed-seqs.qza --m-forward-barcodes-file /bioinf/home/acastill/Bioinf_16S_Oct2019/Edits_4246A_Feb2020/metadata_4246A.tsv --m-forward-barcodes-column Barcode --p-error-rate 0 --o-per-sample-sequences /bioinf/home/acastill/Bioinf_16S_Oct2019/Edits_4246A_Feb2020/demultiplexed-seqs.qza --o-untrimmed-sequences /bioinf/home/acastill/Bioinf_16S_Oct2019/Edits_4246A_Feb2020/untrimmed.qza

Later in the script I've used standalone cutadapt to get rid of the empty sequences and they generally only make up a small percentage of all sequences (from 0% to 1-2% sequences removed), but I'm nervous that ignoring the cause of the problem will have consequences in data interpretation.
Any help would be greatly appreciated!

Welcome to the forum @Andrea_C!

Sounds like you've identified the source of the error, as well as a fix (using cutadapt to remove empty sequences), let me see if I can answer your remaining concerns:

Sometimes sequencing cores/services will run preliminary QC to filter low-quality sequences. You could discuss with them to see if/what they do, and maybe even get a rawer form of your data :crossed_fingers:

I am not 100% sure, but I don't think it will cause issues downstream if you just remove those sequences (or so that seems to be the case with others who have had similar issues, for example).

So you can use cutadapt to remove those sequences (and if there's an option to remove the paired sequences of these empty seqs that would be best just to avoid possible hitches downstream).

Give that a try and let us know what happens! If you are able to complete demultiplexing and denoising/clustering without error then you should be in the clear... it will have no downstream consequences beyond that stage.

1 Like

Hi @Andrea_C,
Just a follow-up, I didn't catch this before:

That's an ancient version of QIIME 2! The min seq len setting in cutadapt was updated more recently to prevent 0nt reads from winding up in the cutadapt results, so if you update to the latest version you should be able to demultiplex without issue, and without using standalone cutadapt to filter out those samples:

1 Like

That worked like a charm, thank you!

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.