Extra quality score characters: 'G'

I have a set of 16S diversity data (belongs to a time series) that was sequenced by a company who provides the data in a 454 format, non-demultiplexed. Demultiplexing of the by Qiime1 and Qiime2 methods removes 90% of the reads, however using the binning program of the company produces joined fasta and qual files with plenty of reads as expected. After converting these files to a joined fastq using the company’s convert software then in .gz format within Qiime2, there are issues with importing into Qiime2:

Importing as single end phred 33 the error = /var/folders/g9/yct1k13j1dd4tdhynrbw_czxx7sl86/T/q2-SingleLanePerSampleSingleEndFastqDirFmt-2thgjdik/L45_25_L001_R1_001.fastq.gz is not a(n) FastqGzFormat file: Quality score length doesn’t match sequence length for record beginning on line 5

I tried importing as phred 64 in case it was older Illumina software and received this error:
skbio.io._exception.FASTQFormatError: Found more quality score characters than sequence characters. Extra quality score characters: ‘G’
An unexpected error has occurred: Found more quality score characters than sequence characters. Extra quality score characters: ‘G’

Thanks for your help

Hey there @strevat! That error message means that the record beginning on line 5 of L45_25_L001_R1_001.fastq.gz is malformed --- there appear to be more quality scores than nucleotides, like this:

@SEQ_ID
GATTTG
+
!''*((((***+))%%%++)(%%

Here, there are 2 or three times more quality scores than nucleotides.

That sounds like the culprit --- is it possible that this software is outputting malformed FASTQ? I would check with them, show them your results!

:qiime2: :t_rex: :sunny:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.