This was an issue that arose out of another problem I had, so I figured it would be best to create another ticket for this problem.
As the title states, I have FASTQ files that seem to have an extra character in the quality score line vs the sequence line. When I try to import my manifest file I getting the warning:
There was a problem importing seqs.tsv:
/var/folders/zc/csj0fb595j98l9vn8xybjdr40000gp/T/q2-SingleLanePerSampleSingleEndFastqDirFmt-lp2cavnw/LCl-85_212_L001_R1_001.fastq.gz is not a(n) FastqGzFormat file:
Quality score length doesnât match sequence length for record beginning on line 5.
So, I checked the file manually & there was one extra character in the quality score (e.g. 316) vs the sequence (e.g. 315). It seems that this is an issue for some. This person had the same issue due to a joining of FASTA & quality scores with a converter. That got me to thinking about the way Windows & Mac (or Unix) code their line breaks/endings, as has been mentioned to me before.
Using BBEdit, I have gone through every fastq file in the folder holding my sequences & switched the line break types to Mac (CR) & made sure that each seq/fastq file had only four lines (all of them had an extra space that caused the file to have 5 lines, though only four lines had any data/info). Iâm thankful I only have ~425 sequences.
Having done all of that, I am still getting this error. I have checked the number of characters in both the quality scores as well as the sequences themselves & they have an identical number of characters. So, Iâm really not sure why this is still an issue. Would it help at all to change the line break type to Unix (LF)? I canât imagine that would be the case, but Iâm completely lost on this issue.
Yes! Linux software, like qiime 2, expect Unix (\n) line endings. Hopefully this will immediately solve your issue now that you removed that extra symbol.
I changed all of the line breaks to Unix (LF) format but I am still getting the same error. Itâs weird because it doesnât throw that error with the first sequence in the file. Rather, it seems to be random. I attempted to remove the files that were called out in each error from the data, but it still has issues with certain files. One error would be mid-way down the file list, then the next error would be near the top of the listâŚthe next near the bottom. Iâm not sure what else to do.
Yes, it is the same error. Yesterday, I tried to import and it would throw this same error at a specific file. When I removed the file(s) from the folder it would throw an error on another file. Once I put the file(s) back in the folder it would throw the error on that file again. Having said that, I made a list of files it doesnât like, up to five. The newest error message (after I converted everything to an LF break) is for a file that wasnât on the list yesterday.
/var/folders/zc/csh0fb595j98l9vn8xybjdr40000gp/T/q2-SingleLanePerSampleSingleEndFastqDirFmt-1m3ec9t2/HCO3-13_90_L001_R1_001.fastq.gz is not a(n) FastqGzFormat file:
Quality score length doesnât match sequence length for record beginning on line 5
The thing is, I donât have fastq.gz files, I have fastq files. I am trying to import my manifest file with the instructions for âFastq Manifest Formatsâ found here.
Yep, the manifest format will gzip any unzipped files for you.
Iâm not convinced this has anything to do with line-endings (yet). Can you please provide the first few lines of the file with sample IDs LCl-85 and HCO3? You can run head $FILENAME to get that info. Thanks!
I will post this shortly. I am not able to connect to wifi with my laptop (where Iâm using q2) & I fear trying to type everything out will take WAY too much time & is error prone. I will post the results when I get back into wifi range.
However, I can give you the actual header instead of the entire sequence, for now:
Hmm, something still isnât quite right - where are all of these line breaks coming from? Is this really what the results of running head HCO3-13_R1.fastq looked like? Or was it like this:
That was a direct copypasta from running head samplename.fastq. So if that is how q2 is reading it, how would I correct it? What is going on under the hood that makes it read like that?
Okay, letâs scrap whatever you did in BBEdit, that doesnât seem to be helping you at all. So, when you run on the âoriginal files,â you mentioned this bit:
No idea. Everything looked the same, but when running a character count the quality score had one extra character compared to the sequence. Iâm not sure if it is a space at the end (was more apparent in BBEdit) or if it was an actual extra character somewhere.