Fastq.gz and quality score length do not match using type: EMPPairedEndSequences from MiSeq

chelsea.brisson.423 · March 20, 2020, 5:52pm

I have three fastq.gz files from our MiSeq:
forward.fastq.gz
reverse.fastq.gz
barcodes.fastq.gz
They are in their own directory called EXP1_ToProcess on Qiime2 (I’m running version 2019.7 in conda on a mac), and I used the code:

qiime tools import --type EMPPairedEndSequences --input-path EXP1_ToProcess/ --output-path EXP1_paired_end_seqs

This has worked in the past, but now I get the error message:

There was a problem importing EXP1_ToProcess/:

EXP1_ToProcess/forward.fastq.gz is not a(n) FastqGzFormat file:

Quality score length doesn’t match sequence length for record beginning on line 43126365

From other threads, I gather that this is pretty far down in the file and might be hard to troubleshoot, but I was hoping someone had an idea of what’s going on! We have used these exact files in Qiime to process, but would like to process in Qiime2 now.
Also, I tried running the code with --verbose, but it just gives the error “no such option: --verbose”

Thank you!

thermokarst · March 24, 2020, 2:35pm

Hi @chelsea.brisson.423!

If I had to guess, the file forward.fastq.gz wasn't completely transferred when moving it to the mac that you're running QIIME 2 on. This can happen - network errors sometimes cause files to look like they have completely transferred, when in reality they aren't all there. The reason I think that is the case is because of the specific error message. As you pointed out, the error is down near the bottom of the file (which represents the last part of the file transferred). As well, the error message is complaining that the quality scores in a record are shorter than the sequences for the same record:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%

That can happen if the network connection died or encountered an issue - a partially transferred file and record.

Double check that you have the complete files (md5 checksums can help with that, or just measuring the file size).

Keep us posted!

chelsea.brisson.423 · March 31, 2020, 6:08pm

Thank you so much! I think this worked, although I’m running into another problem in the next step when I try to demultiplex the reads that makes me think it didn’t actually work…
I am using this code:
qiime demux emp-paired
–i-seqs paired_end_seqs.qza
–m-barcodes-file MapingFile.txt
–m-barcodes-column BarcodeSequence
–o-per-sample-sequences demux
–o-error-correction-details error_details
–output-dir demux_dir
–p-rev-comp-mapping-barcodes

and the error message I get is:
Plugin error from demux:

*** Mismatched sequence descriptions: N:0:0, N:0:NGAGTTGTAGCGA, and N:0:NGAGTTGTAGCGA***

Debug info has been saved to /tmp/qiime2-q2cli-err-n3za1s2l.log

I found another thread where N:0:1 did not match and they redownloaded the files and everything worked, but we do not have the original files from the MiSeq, and my issue starts with N:0:0 (Plugin error from demux: Mismatched sequence descriptions). My fastq.gz files still work with Qiime1, but crossing over to Qiime2 seems to be a huge issue…

thermokarst · April 1, 2020, 2:48pm

Personally, I wouldn't trust the QIIME 1 results in this case - I wasn't involved with the QIIME 1 project, but I suspect that QIIME 1 just wasn't performing the same level of validation of the sequences (but, I could very well be wrong) that QIIME 2 is.

Has some kind of pre-processing been applied to these data? Can you talk a little bit more about the upstream processing, if any?

chelsea.brisson.423 · April 1, 2020, 11:21pm

@thermokarst Thanks for your reply!
Sure! We don’t want to use the Qiime1 results either, which is why we are trying to run the original fastq.gz files from the MiSeq through Qiime2, getting ASVs instead of OTUs.
As far as I know, nothing upstream has been done so far besides what was mentioned in this thread. The fastq.gz files we are using are from the MiSeq output. We are not trying to convert Qiime1 output to Qiime2 - we are trying to altogether avoid Qiime1.

thermokarst · April 2, 2020, 2:24pm

Hi @chelsea.brisson.423 - I'm not too sure what else to tell you here - there appears to be an issue with these data (or our understanding of their nature) - if they were prepared using the EMP protocol (wet lab and sequencing programming) the forward, reverse, and barcode reads should all be in the same "read" order, and should all have the same number of reads. We can try assessing the read counts, but it won't tell us what we don't already know:

for f in *.fastq.gz; do r=$(( $(gunzip -c $f | wc -l | tr -d '[:space:]') / 4 )); echo $r $f; done

Are you able to consult with whoever did the original sequencing, and learn what the software protocol used was? It sounds like maybe it wasn't actually EMP...

Keep us posted.

chelsea.brisson.423 · April 11, 2020, 12:25am

Hi @thermokarst - thanks for the reply! The original protocol was EMP. We ended up reverse complementing the barcodes manually and that worked. Still not sure why the dataset worked in Qiime but not Qiime2!
Thanks for the help!

thermokarst · April 13, 2020, 2:17pm

Hmm, the errors you shared above doesn't really have anything to do with the orientation of the barcodes. For anyone else who might come across this topic, my hypothesis is that there was a file mixup somewhere (maybe a partial transfer), and this process of RCing helped get everything in situated. I don't think that reverse complementing would have anything to do with either of the errors posted above, though, so for those following along please don't just RC your reads because you read about it here. Luckily @chelsea.brisson.423 got two birds with one stone here, because they almost certainly would've needed to RC their reads, anyway - sounds like it all got sorted out in one shot.