FASTQ quality scores & sequences have two different lengths

jhines1 · September 11, 2019, 12:45pm

This was an issue that arose out of another problem I had, so I figured it would be best to create another ticket for this problem.

As the title states, I have FASTQ files that seem to have an extra character in the quality score line vs the sequence line. When I try to import my manifest file I getting the warning:

There was a problem importing seqs.tsv:

/var/folders/zc/csj0fb595j98l9vn8xybjdr40000gp/T/q2-SingleLanePerSampleSingleEndFastqDirFmt-lp2cavnw/LCl-85_212_L001_R1_001.fastq.gz is not a(n) FastqGzFormat file:

Quality score length doesn't match sequence length for record beginning on line 5.

So, I checked the file manually & there was one extra character in the quality score (e.g. 316) vs the sequence (e.g. 315). It seems that this is an issue for some. This person had the same issue due to a joining of FASTA & quality scores with a converter. That got me to thinking about the way Windows & Mac (or Unix) code their line breaks/endings, as has been mentioned to me before.

Using BBEdit, I have gone through every fastq file in the folder holding my sequences & switched the line break types to Mac (CR) & made sure that each seq/fastq file had only four lines (all of them had an extra space that caused the file to have 5 lines, though only four lines had any data/info). I'm thankful I only have ~425 sequences.

Having done all of that, I am still getting this error. I have checked the number of characters in both the quality scores as well as the sequences themselves & they have an identical number of characters. So, I'm really not sure why this is still an issue. Would it help at all to change the line break type to Unix (LF)? I can't imagine that would be the case, but I'm completely lost on this issue.

Thanks for your patience & wisdom!

colinbrislawn · September 11, 2019, 1:18pm

Good morning @jhines1,

Sorry you are having this issue.

Yes! Linux software, like qiime 2, expect Unix (\n) line endings. Hopefully this will immediately solve your issue now that you removed that extra symbol.

Let us know what you try,
Colin

jhines1 · September 11, 2019, 2:11pm

I changed all of the line breaks to Unix (LF) format but I am still getting the same error. It's weird because it doesn't throw that error with the first sequence in the file. Rather, it seems to be random. I attempted to remove the files that were called out in each error from the data, but it still has issues with certain files. One error would be mid-way down the file list, then the next error would be near the top of the list...the next near the bottom. I'm not sure what else to do.

colinbrislawn · September 11, 2019, 3:18pm

Is this the error you are still getting? If so, the Unix line endings might be fixed, but the fastq file could still be strange.

Spooky!
Is it a truly random and different read each time you rerun on a single file? Or does it always stop at the same read in a specific file?

Colin

jhines1 · September 11, 2019, 3:23pm

Yes, it is the same error. Yesterday, I tried to import and it would throw this same error at a specific file. When I removed the file(s) from the folder it would throw an error on another file. Once I put the file(s) back in the folder it would throw the error on that file again. Having said that, I made a list of files it doesn't like, up to five. The newest error message (after I converted everything to an LF break) is for a file that wasn't on the list yesterday.

colinbrislawn · September 11, 2019, 3:37pm

Does the error tell you what line of the file gives you issues? Can you post the full error for me?

jhines1 · September 11, 2019, 3:48pm

There was a problem importing seqs.tsv:

/var/folders/zc/csh0fb595j98l9vn8xybjdr40000gp/T/q2-SingleLanePerSampleSingleEndFastqDirFmt-1m3ec9t2/HCO3-13_90_L001_R1_001.fastq.gz is not a(n) FastqGzFormat file:

Quality score length doesn't match sequence length for record beginning on line 5

The thing is, I don't have fastq.gz files, I have fastq files. I am trying to import my manifest file with the instructions for "Fastq Manifest Formats" found here.

Here is the code I used for this import:

qiime tools import
--type SampleData[SequencesWithQuality]
--input-path seqs.tsv
--output-path seqs.qza
--input-format SingleEndFastqManifestPhred33V2

It should be noted that I have forward-only reads from Sanger sequencing, so they are not EMP or Cassava formatted & the offset is 33 according to this page that was posted on the Importing Data tutorial.

thermokarst · September 11, 2019, 5:11pm

Hi @jhines1!

Yep, the manifest format will gzip any unzipped files for you.

I'm not convinced this has anything to do with line-endings (yet). Can you please provide the first few lines of the file with sample IDs LCl-85 and HCO3? You can run head $FILENAME to get that info. Thanks!

jhines1 · September 11, 2019, 8:07pm

I will post this shortly. I am not able to connect to wifi with my laptop (where I'm using q2) & I fear trying to type everything out will take WAY too much time & is error prone. I will post the results when I get back into wifi range.

However, I can give you the actual header instead of the entire sequence, for now:

head HCO3-13_R1.fastq

@HCO3-13_M13F_Plate_DNA_00002319_A11.ab1 extraction
TGGTTCTGG...
+
NNUXXUUID...

thermokarst · September 11, 2019, 8:37pm

Thanks! Will take a closer look when you post the complete output from the head command (its actually the stuff here that I am looking for!):

Keep us posted!

jhines1 · September 11, 2019, 9:42pm

Thanks for your patience. Here are the callouts:

head HCO3-13_R1.fastq

@HCO3-13_M13F_Plate_DNA_00002319_A11.ab1 extraction
TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCTGTGGTATTCCGCAGGGCATGCCTGTTCGAGCGTCATTTCAACCCATCAAGCTCACGCTTGGTCTTGGGGCCTGCGGTTTCGCAGCCTCTAAACTCAGTGGCGGTGCGATTGAGCTCTGAGCGTAGTAATTTTTCTCGCTATAGGGTCTCGGTCGTGACTTGCCAGTAACCCCCAATTTTTATCAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGA

'+' <-- I added the quote marks to get the plus sign to show

NNUUXXUUIDDNOIIIUNNNNXNUXXXXXUUNNUXXUUUXUUU88UUUUXXXXUIINXXUUUUUDUXXXUUXXUUUXXXUNNXUUUUNNUNNNUUXUU::INUUUXXUUUXUXXNUNUXUNNUX==UUUXUNNUXUUUUXXXUUUUNUUNUUNUU;;NXXXUNUXXXXUNNUUXXXXUUUNNUUUUXXXUNNNUNUUXXXUNNUUUUUNUNUNUNNNUUUUUUUUXXXXXNNUUUUUUUNUXXXUUUXXXXUUNNXUUUUUUUUUXXXXXXXXXXXUUNUUXXXUUUUUUUUUNNXXXUUUXXUUUUNUXUUUUNN

head LCl-85_R1.fastq

@LCl-85_M13F_Plate_DNA_00002320_D03.ab1 extraction

TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGGTGGTATTCCACCGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAAGCCTGGCTTTGGTGTTGGAGGGATACCTGTAAAAGGGTACCCTCTGAAATTTAGTGGCGGGCTCGCTAGAATTTTGAGCGTAGTAGTTTTACCTCGTTTTTAAAGACTAGTGGGACTTCTTGCCGTAAAACCCCCCAACTTTCTGAAAATTGACCTCGGATCAGGTAGGAATACCCGCTGA

'+' <-- again with the quotes

OPYY_7<>JOAA<SAHTTKYTS8GGG<.IY_YYTKYKYYJAGGYLHHN\Y\TML\YTNTTHH?TNLONJT3OLLR9HLJOOSIILCTCCFITT\\OLCOONLT\JT?SL\TT\LRLYILRLS?LRSCRRYRRT\YTY_OYRTSQY_YLTSCASLTTRTTLT\TTY\T\Y_TLTS\\\\Y_WNY9?HL\YLTTRRLILW\YTYSSSTROL\YSYTRYIY\R\\\\W_T\ITTYYY__S\RCRRROY\\RY_\Y\_\\WY___WWLLQSNESW-88RT_RRYYW_Y\LYYYEWQCQNC=CSOYOWRWYCL\

thermokarst · September 11, 2019, 10:04pm

Can you try pasting again, this time in a code fence:

```
your pasted output here
```

That way you don't have to do all kinds of funky formatting.

jhines1 · September 11, 2019, 10:17pm

Sure, sorry about that.

 head HCO3-13_R1.fastq

@HCO3-13_M13F_Plate_DNA_00002319_A11.ab1 extraction
TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCTGTGGTATTCCGCAGGGCATGCCTGTTCGAGCGTCATTTCAACCCATCAAGCTCACGCTTGGTCTTGGGGCCTGCGGTTTCGCAGCCTCTAAACTCAGTGGCGGTGCGATTGAGCTCTGAGCGTAGTAATTTTTCTCGCTATAGGGTCTCGGTCGTGACTTGCCAGTAACCCCCAATTTTTATCAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGA

+ 

NNUUXXUUIDDNOIIIUNNNNXNUXXXXXUUNNUXXUUUXUUU88UUUUXXXXUIINXXUUUUUDUXXXUUXXUUUXXXUNNXUUUUNNUNNNUUXUU::INUUUXXUUUXUXXNUNUXUNNUX==UUUXUNNUXUUUUXXXUUUUNUUNUUNUU;;NXXXUNUXXXXUNNUUXXXXUUUNNUUUUXXXUNNNUNUUXXXUNNUUUUUNUNUNUNNNUUUUUUUUXXXXXNNUUUUUUUNUXXXUUUXXXXUUNNXUUUUUUUUUXXXXXXXXXXXUUNUUXXXUUUUUUUUUNNXXXUUUXXUUUUNUXUUUUNN


head LCl-85_R1.fastq

@LCl-85_M13F_Plate_DNA_00002320_D03.ab1 extraction

TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGGTGGTATTCCACCGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAAGCCTGGCTTTGGTGTTGGAGGGATACCTGTAAAAGGGTACCCTCTGAAATTTAGTGGCGGGCTCGCTAGAATTTTGAGCGTAGTAGTTTTACCTCGTTTTTAAAGACTAGTGGGACTTCTTGCCGTAAAACCCCCCAACTTTCTGAAAATTGACCTCGGATCAGGTAGGAATACCCGCTGA

+

OPYY_7<>JOAA<SAHTTKYTS8GGG<.IY_YYTKYKYYJAGGYLHHN\\Y\TML\YTNTTHH?TNLONJT3OLLR9HLJOOSIILCTCCFITT\\\OLCOONLT\JT?SL\TT\LRLYILRLS?LRSCRRYRRT\YTY\_OYRTSQY_YLTSCASLTTRTTLT\TTY\T\Y_TLTS\\_\\\\\Y_WNY9?HL\YLTTRRLILW\YTYSSSTROL_\\YSYTRYIY\R\\\\\\\\W_T\ITTYYY_\_S\RCRRROY\\\RY_\Y\\_\\\WY\___WWLLQSNESW-88RT_RRYYW\_Y\LYYYEWQCQNC=CSOYOWRWYCL\\

thermokarst · September 11, 2019, 10:26pm

Hmm, something still isn't quite right - where are all of these line breaks coming from? Is this really what the results of running head HCO3-13_R1.fastq looked like? Or was it like this:

@HCO3-13_M13F_Plate_DNA_00002319_A11.ab1 extraction
TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCTGTGGTATTCCGCAGGGCATGCCTGTTCGAGCGTCATTTCAACCCATCAAGCTCACGCTTGGTCTTGGGGCCTGCGGTTTCGCAGCCTCTAAACTCAGTGGCGGTGCGATTGAGCTCTGAGCGTAGTAATTTTTCTCGCTATAGGGTCTCGGTCGTGACTTGCCAGTAACCCCCAATTTTTATCAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGA
+ 
NNUUXXUUIDDNOIIIUNNNNXNUXXXXXUUNNUXXUUUXUUU88UUUUXXXXUIINXXUUUUUDUXXXUUXXUUUXXXUNNXUUUUNNUNNNUUXUU::INUUUXXUUUXUXXNUNUXUNNUX==UUUXUNNUXUUUUXXXUUUUNUUNUUNUU;;NXXXUNUXXXXUNNUUXXXXUUUNNUUUUXXXUNNNUNUUXXXUNNUUUUUNUNUNUNNNUUUUUUUUXXXXXNNUUUUUUUNUXXXUUUXXXXUUNNXUUUUUUUUUXXXXXXXXXXXUUNUUXXXUUUUUUUUUNNXXXUUUXXUUUUNUXUUUUNN

thermokarst · September 11, 2019, 10:30pm

Also, is this representative of before or after using BBEdit to change the line breaks? If after, can you send a head sample of the before?

jhines1 · September 11, 2019, 10:32pm

That was a direct copypasta from running head samplename.fastq. So if that is how q2 is reading it, how would I correct it? What is going on under the hood that makes it read like that?

jhines1 · September 11, 2019, 10:33pm

This was after loading into BBEdit & changing everything to LF breaks.

jhines1 · September 11, 2019, 10:38pm

Sorry, just now saw the other part of the question. Here is the "before" seq.

@LCl-85_M13F_Plate_DNA_00002320_D03.ab1 extraction
TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGGTGGTATTCCACCGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAAGCCTGGCTTTGGTGTTGGAGGGATACCTGTAAAAGGGTACCCTCTGAAATTTAGTGGCGGGCTCGCTAGAATTTTGAGCGTAGTAGTTTTACCTCGTTTTTAAAGACTAGTGGGACTTCTTGCCGTAAAACCCCCCAACTTTCTGAAAATTGACCTCGGATCAGGTAGGAATACCCGCTGA
+
OPYY_7<>JOAA<SAHTTKYTS8GGG<.IY_YYTKYKYYJAGGYLHHN\\Y\TML\YTNTTHH?TNLONJT3OLLR9HLJOOSIILCTCCFITT\\\OLCOONLT\JT?SL\TT\LRLYILRLS?LRSCRRYRRT\YTY\_OYRTSQY_YLTSCASLTTRTTLT\TTY\T\Y_TLTS\\_\\\\\Y_WNY9?HL\YLTTRRLILW\YTYSSSTROL_\\YSYTRYIY\R\\\\\\\\W_T\ITTYYY_\_S\RCRRROY\\\RY_\Y\\_\\\WY\___WWLLQSNESW-88RT_RRYYW\_Y\LYYYEWQCQNC=CSOYOWRWYCL\\```

thermokarst · September 11, 2019, 10:49pm

Okay, let's scrap whatever you did in BBEdit, that doesn't seem to be helping you at all. So, when you run on the "original files," you mentioned this bit:

what was the extra character you found?

jhines1 · September 11, 2019, 11:03pm

No idea. Everything looked the same, but when running a character count the quality score had one extra character compared to the sequence. I'm not sure if it is a space at the end (was more apparent in BBEdit) or if it was an actual extra character somewhere.