q2-dada2: invalid q score error

Thank you! The reassurance really helps.

I have run into another problem within dada2 denoising I am hoping you could take a look at. I've read a few forum posts from users with similar problems but haven't found a way around the issue.

This is what I am running -
qiime dada2 denoise-paired
--i-demultiplexed-seqs [filepath to paired demux seqs]
--p-trunc-len-f 249
--p-trunc-len-r 250
--p-pooling-method pseudo
--o-table denoisedtable.qza
--o-representative-sequences denoisedrepseqs.qza
--o-denoising-stats denoisedstats.qza

It takes a while for this to run, at least 30 min., then gives me this output -

Based on some other threads, I figured you might need to see this as well -
running R -e .libPaths()
running env

I'm not sure what the problem is exactly. Any suggestions?

Hi @Hayley_Guay ,

The key line in your error is here:

Sample 1 has an invalid maximum Phred Quality Scores of 131097

We've seen this error before when an arbitrarily edited quality score was used or if the sequencing technology somehow is assigning q-scores not within the Phred 33 schema. Have you seen this thread for some suggestions with a similar problem? If not, have a look there first and let us know if you're still stuck. Thanks!


Thank you @Mehrbod_Estaki, but that thread doesn't seem to help. I want to use dada2 to denoise my paired end sequences, and I know my sequences have a Phred score of 33. Could incorrectly choosing between Phred33 and Phred33V2 import schemes cause this error? Is there a way to manually adjust the Phred score reads at this point to continue analysis? I'm unsure of how to proceed.

Ok, thanks for checking there first.

The V2 (version 2) in the import type you see corresponds to the style of the manifest file and not the Phred scores.

You certainly don't want to do such a thing even if you could. The quality scores there are crucial for DADA2 to build an error model which is then used to denoise your reads. Manually changing those is going to mess up with that process, so the best way forward is to sort this out right from the beginning, hopefully we can help you there!

A couple of follow-up that will help us diagnose:

  1. What is the sequencing technology used here
  2. Has there been any other quality control applied to your reads prior to importing them into Q2 for DADA2? Anything performed by you or your sequencing facility that may modify the quality scores?
  3. Can you manually look through some of the quality scores in Sample 1's raw fastq file? We're expecting to only see characters corresponding to the Phred 33 ASCII characters (you can paste a few of the lines here if you think they look odd)

Thanks for the follow up.
My sequence data came from GeneWiz NGS.
The only quality control I performed was before sequencing, so the quality scores should be unmodified. I can reach out to GeneWiz to double check this.
When I open my fatsq files in a text editor (after unzipping), these are the first few lines - I am pretty sure this only uses the ASCII characters, but here it is anyway -

@GWNJ-1013:261:GW2108291419th.Miseq:1:2101:8983:1047 1:N:0:CTGATGAG+CTTCGCCT
@GWNJ-1013:261:GW2108291419th.Miseq:1:2101:22923:1047 1:N:0:CTGATGAG+CTTCGCCT

This might not fix anything but I remembered that I did use the raw files as imported data, rather than unzipping then importing. Is this an issue? The zipped files straight from GeneWiz don't use the ASCII characters (that is, when I open them in a text editor before unzipping). Does Q2 do that step for me, or should I go back and unzip each file then import?
Thanks again.

1 Like

Hi @Hayley_Guay,

Q2 can handle and import the gzipped fastq files, and I'm guessing that's the format your data came from GeneWiz? If so, no issues there.

So it looks like your sequencing center is using Illumina's MiSeq which these days should all be Phred33 based, so if you imported it as Phred33 you're good there as well.

Your quality scores in that one line also seems correct:

These are all Phred33 ASCII characters. However, what I do find a bit odd though is the actual distribution of these. With typical MiSeq data we often see a much more diverse distribution of quality scores that tend to decrease towards the 3' end of the read. For example:

Do a lot of the reads look like this? Can you can check some more lines to see what they look like?
One thing I'd want to check with your sequencing center is first confirm that this is in fact MiSeq data, and second if there has been any post-sequencing processing done on their end to lead to this kind of uniform quality score distribution. With DADA2 you want to start with the raw FASTQ files without any post-processing. Keep us posted.

1 Like

Thank you again.
Glancing through some of my files it looks like a majority of the quality codes look similar throughout. Almost all of the quality scores are one of these three : F , from 3' to 5'.
The sequencing center I used has been less than impressive customer-service-wise so I haven't been able to get a clear answer on the MiSeq data - "MiSeq" appears on many lines of my sequences, is this enough to identify? I requested raw FASTQ data originally, so I expect no post-sequencing data processing has been performed. Not sure how to proceed from here.

1 Like

It turns out that my samples were sequenced on a NovaSeq machine, not MiSeq, which should explain the questionable quality scores. I will be proceeding with Deblur instead of Dada2.
Huge thanks to @Mehrbod_Estaki for catching that detail for me!


Thanks for the update @Hayley_Guay - we were having an out-of-band discussion about this post and the idea of these being NovaSeq reads came up there - it sounds like we're all on the same page. Happy QIIMEing!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.