This was an issue that arose out of another problem I had, so I figured it would be best to create another ticket for this problem.
As the title states, I have FASTQ files that seem to have an extra character in the quality score line vs the sequence line. When I try to import my manifest file I getting the warning:
There was a problem importing seqs.tsv:
/var/folders/zc/csj0fb595j98l9vn8xybjdr40000gp/T/q2-SingleLanePerSampleSingleEndFastqDirFmt-lp2cavnw/LCl-85_212_L001_R1_001.fastq.gz is not a(n) FastqGzFormat file:
Quality score length doesn’t match sequence length for record beginning on line 5.
So, I checked the file manually & there was one extra character in the quality score (e.g. 316) vs the sequence (e.g. 315). It seems that this is an issue for some. This person had the same issue due to a joining of FASTA & quality scores with a converter. That got me to thinking about the way Windows & Mac (or Unix) code their line breaks/endings, as has been mentioned to me before.
Using BBEdit, I have gone through every fastq file in the folder holding my sequences & switched the line break types to Mac (CR) & made sure that each seq/fastq file had only four lines (all of them had an extra space that caused the file to have 5 lines, though only four lines had any data/info). I’m thankful I only have ~425 sequences.
Having done all of that, I am still getting this error. I have checked the number of characters in both the quality scores as well as the sequences themselves & they have an identical number of characters. So, I’m really not sure why this is still an issue. Would it help at all to change the line break type to Unix (LF)? I can’t imagine that would be the case, but I’m completely lost on this issue.
I changed all of the line breaks to Unix (LF) format but I am still getting the same error. It’s weird because it doesn’t throw that error with the first sequence in the file. Rather, it seems to be random. I attempted to remove the files that were called out in each error from the data, but it still has issues with certain files. One error would be mid-way down the file list, then the next error would be near the top of the list…the next near the bottom. I’m not sure what else to do.
Yes, it is the same error. Yesterday, I tried to import and it would throw this same error at a specific file. When I removed the file(s) from the folder it would throw an error on another file. Once I put the file(s) back in the folder it would throw the error on that file again. Having said that, I made a list of files it doesn’t like, up to five. The newest error message (after I converted everything to an LF break) is for a file that wasn’t on the list yesterday.
Yep, the manifest format will gzip any unzipped files for you.
I’m not convinced this has anything to do with line-endings (yet). Can you please provide the first few lines of the file with sample IDs LCl-85 and HCO3? You can run head $FILENAME to get that info. Thanks!
I will post this shortly. I am not able to connect to wifi with my laptop (where I’m using q2) & I fear trying to type everything out will take WAY too much time & is error prone. I will post the results when I get back into wifi range.
However, I can give you the actual header instead of the entire sequence, for now:
No idea. Everything looked the same, but when running a character count the quality score had one extra character compared to the sequence. I’m not sure if it is a space at the end (was more apparent in BBEdit) or if it was an actual extra character somewhere.