FASTQ quality scores & sequences have two different lengths

This was an issue that arose out of another problem I had, so I figured it would be best to create another ticket for this problem.

As the title states, I have FASTQ files that seem to have an extra character in the quality score line vs the sequence line. When I try to import my manifest file I getting the warning:

There was a problem importing seqs.tsv:

/var/folders/zc/csj0fb595j98l9vn8xybjdr40000gp/T/q2-SingleLanePerSampleSingleEndFastqDirFmt-lp2cavnw/LCl-85_212_L001_R1_001.fastq.gz is not a(n) FastqGzFormat file:

Quality score length doesn’t match sequence length for record beginning on line 5.

So, I checked the file manually & there was one extra character in the quality score (e.g. 316) vs the sequence (e.g. 315). It seems that this is an issue for some. This person had the same issue due to a joining of FASTA & quality scores with a converter. That got me to thinking about the way Windows & Mac (or Unix) code their line breaks/endings, as has been mentioned to me before.

Using BBEdit, I have gone through every fastq file :persevere: in the folder holding my sequences & switched the line break types to Mac (CR) & made sure that each seq/fastq file had only four lines (all of them had an extra space that caused the file to have 5 lines, though only four lines had any data/info). I’m thankful I only have ~425 sequences.

Having done all of that, I am still getting this error. I have checked the number of characters in both the quality scores as well as the sequences themselves & they have an identical number of characters. So, I’m really not sure why this is still an issue. Would it help at all to change the line break type to Unix (LF)? I can’t imagine that would be the case, but I’m completely lost on this issue.

Thanks for your patience & wisdom!

Good morning @jhines1,

Sorry you are having this issue.

Yes! Linux software, like qiime 2, expect Unix (\n) line endings. Hopefully this will immediately solve your issue now that you removed that extra symbol.

Let us know what you try,
Colin

I changed all of the line breaks to Unix (LF) format but I am still getting the same error. It’s weird because it doesn’t throw that error with the first sequence in the file. Rather, it seems to be random. I attempted to remove the files that were called out in each error from the data, but it still has issues with certain files. One error would be mid-way down the file list, then the next error would be near the top of the list…the next near the bottom. I’m not sure what else to do.

Is this the error you are still getting? If so, the Unix line endings might be fixed, but the fastq file could still be strange.

Spooky! :ghost:
Is it a truly random and different read each time you rerun on a single file? Or does it always stop at the same read in a specific file?

Colin

Yes, it is the same error. Yesterday, I tried to import and it would throw this same error at a specific file. When I removed the file(s) from the folder it would throw an error on another file. Once I put the file(s) back in the folder it would throw the error on that file again. Having said that, I made a list of files it doesn’t like, up to five. The newest error message (after I converted everything to an LF break) is for a file that wasn’t on the list yesterday.

Does the error tell you what line of the file gives you issues? Can you post the full error for me?

There was a problem importing seqs.tsv:

/var/folders/zc/csh0fb595j98l9vn8xybjdr40000gp/T/q2-SingleLanePerSampleSingleEndFastqDirFmt-1m3ec9t2/HCO3-13_90_L001_R1_001.fastq.gz is not a(n) FastqGzFormat file:

Quality score length doesn’t match sequence length for record beginning on line 5

The thing is, I don’t have fastq.gz files, I have fastq files. I am trying to import my manifest file with the instructions for “Fastq Manifest Formats” found here.

Here is the code I used for this import:

qiime tools import
–type SampleData[SequencesWithQuality]
–input-path seqs.tsv
–output-path seqs.qza
–input-format SingleEndFastqManifestPhred33V2

It should be noted that I have forward-only reads from Sanger sequencing, so they are not EMP or Cassava formatted & the offset is 33 according to this page that was posted on the Importing Data tutorial.

Hi @jhines1!

Yep, the manifest format will gzip any unzipped files for you.

I’m not convinced this has anything to do with line-endings (yet). Can you please provide the first few lines of the file with sample IDs LCl-85 and HCO3? You can run head $FILENAME to get that info. Thanks!

I will post this shortly. I am not able to connect to wifi with my laptop (where I’m using q2) & I fear trying to type everything out will take WAY too much time & is error prone. I will post the results when I get back into wifi range.

However, I can give you the actual header instead of the entire sequence, for now:

head HCO3-13_R1.fastq

@HCO3-13_M13F_Plate_DNA_00002319_A11.ab1 extraction
TGGTTCTGG…
+
NNUXXUUID…

Thanks! Will take a closer look when you post the complete output from the head command (its actually the stuff here that I am looking for!):

Keep us posted!

Thanks for your patience. Here are the callouts:

head HCO3-13_R1.fastq

@HCO3-13_M13F_Plate_DNA_00002319_A11.ab1 extraction
TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCTGTGGTATTCCGCAGGGCATGCCTGTTCGAGCGTCATTTCAACCCATCAAGCTCACGCTTGGTCTTGGGGCCTGCGGTTTCGCAGCCTCTAAACTCAGTGGCGGTGCGATTGAGCTCTGAGCGTAGTAATTTTTCTCGCTATAGGGTCTCGGTCGTGACTTGCCAGTAACCCCCAATTTTTATCAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGA

‘+’ <-- I added the quote marks to get the plus sign to show

NNUUXXUUIDDNOIIIUNNNNXNUXXXXXUUNNUXXUUUXUUU88UUUUXXXXUIINXXUUUUUDUXXXUUXXUUUXXXUNNXUUUUNNUNNNUUXUU::INUUUXXUUUXUXXNUNUXUNNUX==UUUXUNNUXUUUUXXXUUUUNUUNUUNUU;;NXXXUNUXXXXUNNUUXXXXUUUNNUUUUXXXUNNNUNUUXXXUNNUUUUUNUNUNUNNNUUUUUUUUXXXXXNNUUUUUUUNUXXXUUUXXXXUUNNXUUUUUUUUUXXXXXXXXXXXUUNUUXXXUUUUUUUUUNNXXXUUUXXUUUUNUXUUUUNN

head LCl-85_R1.fastq

@LCl-85_M13F_Plate_DNA_00002320_D03.ab1 extraction

TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGGTGGTATTCCACCGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAAGCCTGGCTTTGGTGTTGGAGGGATACCTGTAAAAGGGTACCCTCTGAAATTTAGTGGCGGGCTCGCTAGAATTTTGAGCGTAGTAGTTTTACCTCGTTTTTAAAGACTAGTGGGACTTCTTGCCGTAAAACCCCCCAACTTTCTGAAAATTGACCTCGGATCAGGTAGGAATACCCGCTGA

‘+’ <-- again with the quotes

OPYY_7<>JOAA<SAHTTKYTS8GGG<.IY_YYTKYKYYJAGGYLHHN\Y\TML\YTNTTHH?TNLONJT3OLLR9HLJOOSIILCTCCFITT\\OLCOONLT\JT?SL\TT\LRLYILRLS?LRSCRRYRRT\YTY_OYRTSQY_YLTSCASLTTRTTLT\TTY\T\Y_TLTS\\\\Y_WNY9?HL\YLTTRRLILW\YTYSSSTROL\YSYTRYIY\R\\\\W_T\ITTYYY__S\RCRRROY\\RY_\Y\_\\WY___WWLLQSNESW-88RT_RRYYW_Y\LYYYEWQCQNC=CSOYOWRWYCL\

Can you try pasting again, this time in a code fence:

```
your pasted output here
```

That way you don’t have to do all kinds of funky formatting.

Sure, sorry about that.

 head HCO3-13_R1.fastq

@HCO3-13_M13F_Plate_DNA_00002319_A11.ab1 extraction
TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCTGTGGTATTCCGCAGGGCATGCCTGTTCGAGCGTCATTTCAACCCATCAAGCTCACGCTTGGTCTTGGGGCCTGCGGTTTCGCAGCCTCTAAACTCAGTGGCGGTGCGATTGAGCTCTGAGCGTAGTAATTTTTCTCGCTATAGGGTCTCGGTCGTGACTTGCCAGTAACCCCCAATTTTTATCAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGA

+ 

NNUUXXUUIDDNOIIIUNNNNXNUXXXXXUUNNUXXUUUXUUU88UUUUXXXXUIINXXUUUUUDUXXXUUXXUUUXXXUNNXUUUUNNUNNNUUXUU::INUUUXXUUUXUXXNUNUXUNNUX==UUUXUNNUXUUUUXXXUUUUNUUNUUNUU;;NXXXUNUXXXXUNNUUXXXXUUUNNUUUUXXXUNNNUNUUXXXUNNUUUUUNUNUNUNNNUUUUUUUUXXXXXNNUUUUUUUNUXXXUUUXXXXUUNNXUUUUUUUUUXXXXXXXXXXXUUNUUXXXUUUUUUUUUNNXXXUUUXXUUUUNUXUUUUNN


head LCl-85_R1.fastq

@LCl-85_M13F_Plate_DNA_00002320_D03.ab1 extraction

TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGGTGGTATTCCACCGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAAGCCTGGCTTTGGTGTTGGAGGGATACCTGTAAAAGGGTACCCTCTGAAATTTAGTGGCGGGCTCGCTAGAATTTTGAGCGTAGTAGTTTTACCTCGTTTTTAAAGACTAGTGGGACTTCTTGCCGTAAAACCCCCCAACTTTCTGAAAATTGACCTCGGATCAGGTAGGAATACCCGCTGA

+

OPYY_7<>JOAA<SAHTTKYTS8GGG<.IY_YYTKYKYYJAGGYLHHN\\Y\TML\YTNTTHH?TNLONJT3OLLR9HLJOOSIILCTCCFITT\\\OLCOONLT\JT?SL\TT\LRLYILRLS?LRSCRRYRRT\YTY\_OYRTSQY_YLTSCASLTTRTTLT\TTY\T\Y_TLTS\\_\\\\\Y_WNY9?HL\YLTTRRLILW\YTYSSSTROL_\\YSYTRYIY\R\\\\\\\\W_T\ITTYYY_\_S\RCRRROY\\\RY_\Y\\_\\\WY\___WWLLQSNESW-88RT_RRYYW\_Y\LYYYEWQCQNC=CSOYOWRWYCL\\

Hmm, something still isn’t quite right - where are all of these line breaks coming from? Is this really what the results of running head HCO3-13_R1.fastq looked like? Or was it like this:

@HCO3-13_M13F_Plate_DNA_00002319_A11.ab1 extraction
TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCTGTGGTATTCCGCAGGGCATGCCTGTTCGAGCGTCATTTCAACCCATCAAGCTCACGCTTGGTCTTGGGGCCTGCGGTTTCGCAGCCTCTAAACTCAGTGGCGGTGCGATTGAGCTCTGAGCGTAGTAATTTTTCTCGCTATAGGGTCTCGGTCGTGACTTGCCAGTAACCCCCAATTTTTATCAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGA
+ 
NNUUXXUUIDDNOIIIUNNNNXNUXXXXXUUNNUXXUUUXUUU88UUUUXXXXUIINXXUUUUUDUXXXUUXXUUUXXXUNNXUUUUNNUNNNUUXUU::INUUUXXUUUXUXXNUNUXUNNUX==UUUXUNNUXUUUUXXXUUUUNUUNUUNUU;;NXXXUNUXXXXUNNUUXXXXUUUNNUUUUXXXUNNNUNUUXXXUNNUUUUUNUNUNUNNNUUUUUUUUXXXXXNNUUUUUUUNUXXXUUUXXXXUUNNXUUUUUUUUUXXXXXXXXXXXUUNUUXXXUUUUUUUUUNNXXXUUUXXUUUUNUXUUUUNN

Also, is this representative of before or after using BBEdit to change the line breaks? If after, can you send a head sample of the before?

That was a direct copypasta from running head samplename.fastq. So if that is how q2 is reading it, how would I correct it? What is going on under the hood that makes it read like that?

1 Like

This was after loading into BBEdit & changing everything to LF breaks.

Sorry, just now saw the other part of the question. Here is the “before” seq.

@LCl-85_M13F_Plate_DNA_00002320_D03.ab1 extraction
TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGGTGGTATTCCACCGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAAGCCTGGCTTTGGTGTTGGAGGGATACCTGTAAAAGGGTACCCTCTGAAATTTAGTGGCGGGCTCGCTAGAATTTTGAGCGTAGTAGTTTTACCTCGTTTTTAAAGACTAGTGGGACTTCTTGCCGTAAAACCCCCCAACTTTCTGAAAATTGACCTCGGATCAGGTAGGAATACCCGCTGA
+
OPYY_7<>JOAA<SAHTTKYTS8GGG<.IY_YYTKYKYYJAGGYLHHN\\Y\TML\YTNTTHH?TNLONJT3OLLR9HLJOOSIILCTCCFITT\\\OLCOONLT\JT?SL\TT\LRLYILRLS?LRSCRRYRRT\YTY\_OYRTSQY_YLTSCASLTTRTTLT\TTY\T\Y_TLTS\\_\\\\\Y_WNY9?HL\YLTTRRLILW\YTYSSSTROL_\\YSYTRYIY\R\\\\\\\\W_T\ITTYYY_\_S\RCRRROY\\\RY_\Y\\_\\\WY\___WWLLQSNESW-88RT_RRYYW\_Y\LYYYEWQCQNC=CSOYOWRWYCL\\```
1 Like

Okay, let’s scrap whatever you did in BBEdit, that doesn’t seem to be helping you at all. So, when you run on the “original files,” you mentioned this bit:

what was the extra character you found?

No idea. Everything looked the same, but when running a character count the quality score had one extra character compared to the sequence. I’m not sure if it is a space at the end (was more apparent in BBEdit) or if it was an actual extra character somewhere.