DADA2 Fails on AVITI Data

Hi all, back here with another AVITI issue. Our latest AVITI 16S V4 dataset fails at the DADA2 step with the following error:

in dada(drpR, err = errR, multithread = multithread, verbose = FALSE) :
Invalid derep$quals matrix. Quality values must be positive integers.
2: stop("Invalid derep$quals matrix. Quality values must be positive integers.")
1: dada(drpR, err = errR, multithread = multithread, verbose = FALSE)

Has anyone encountered this error? I know a bunch of others have posted here about this error code, but each one has a different cause (1, 2, 3), and I think my cause is different as well.

My guess is DADA2 was built for Illumina data, where the max Qscore is 41. AVITI has higher quality scores (up to Q50; ref). That confuses DADA2's automatic Quality score encoding identifier and misidentifies the data as Phred64 instead of Phred33. To circumvent this, I need to specify to DADA2 that the encoding is strictly Phred33 using the argument: qualityType="FastqQuality" per this thread.
But if I am not mistaken, QIIME2's implementation of DADA2 does not have an option to allow this argument to be passed. Does anyone have any options other than using DADA2 outside of the QIIME2 environment to get past this?

Is there a way to verify if this indeed is the issue, given that this error has been spotted with different causes?

Alternative causes for this error, and troubleshooting steps are welcome.

Thanks in advance for helping out.

If someone could share what the dada2 pipeline looks like under QIIME2, that would be useful. I would like to follow that closely. I want to make minimal changes and only add the qualityType="FastqQuality" parameter for explicit declaration of Phred33 encoding.

Hello @AviTil,

What q2-dada2 command are you running, and what commands did you run upstream of it? We typically default to setting the phred-offset to 33 internally. If you visit view.qiime2.org and drop in the .qza file containing your demultiplexed sequences you're using as input to dada2, then open the metadata.yaml file on the Data tab, it will show you the phred-offset we set for your data.

That being said, I am not positive that dada2 will respect this.

1 Like

Hi @Oddant1 ,

Thanks for the reply. I am running qiime dada2 denoise-paired. Here is a brief explanation of the preceding steps.

  • Multiplexed data was imported as an EMP Paired End Formatted Data, then demultiplexed.
  • The demux.qza was summarized to produce demux.qzv shows Q40-Q45 quality scores consistently throughout both forward and reverse reads, as expected with AVITI data.
  • AVITI Data is quite large, even in .qza format, so my browser is unable to view the .qza directly on view.qiime.org. However, I unzipped the .qza and took a peek at the metadata.yaml directly, and it contains the following lines:
  • uuid: 3fece481-0627-44b2-b7ad-78d370bf8ce6
  • type: SampleData[PairedEndSequencesWithQuality]
  • format: SinglelLanePerSamplePairedEndFastqDirFmt

Now, I don’t think this is a problem with QIIME2 detecting quality score formatting. But the problem is that the Phred+33 quality score formatting set in the preceding QIIME2 steps is not being enforced inq2-dada. I had a peek at the run_dada2.R script from the latest q2-dada2 GitHub repo, and I can confirm that in numerous places where the commands drpF=derepFastq(filts[[j]]) & drpR=derepFastq(filtsR[[j]]) are called, the Phred offset is not explicitly enforced. The default value for the parameter thet determines the Phred offset is Auto. The program attempts to auto-determine the offset, which is where I think issues arise. I recall seeing the same issue arise in some threads/posts for PacBio data, and the recommended solution is to set qualityType="FastqQuality". Both AVITI and PacBio have high-quality reads, which are triggering this issue.

I have now fixed this issue, and here is what I did:

  1. Install a new conda environment with QIIME2 (as of writing this, we are using v2024.10) independently of the one installed on the High Performance Compute Cluster that we use.

  2. Navigate to the /.../.conda/<env>/bin folder (<env> is the environment name). Find run_dada2.R. Find all instances ofdrpF=derepFastq(filts[[j]]) & drpR=derepFastq(filtsR[[j]])commands in the script. Add the parameter qualityType="FastqQuality" to these commands.

I think it would be worthwhile for the next QIIME2 version to enforce the Phred scoring into dada2. However, I wanted to update this post in case others find it useful in the meantime.

This raises another question for me - the error was thrown by DADA2 in the denoising step, and was fixed by explicitly enforcing the Phred offset. However, are there other steps within DADA2 preceding denoising that could have the same issue but do not lead to an explicit error being thrown? One such example that I am looking at is the learnErrors() function, which also accepts a parameter for explicit Phred offset, but is left to Auto. I am worried that the error model without this offset might be incorrect, which is then applied to the full dataset. Also, let me know if any other step would benefit from this explicit Phred offset declaration, as I have no idea about the under-the-hood working of dada2.

2 Likes

Excellent work debugging that. I am also not familiar with the inner workings of DADA2. I would probably err on the side of setting this manually everywhere.

I have created an issue referencing this on the q2-dada2 GitHub page here. I'm not sure why we don't already expose this functionality; there may be some underlying technical reason, or it may just be that auto has always worked well enough that this was never a priority.

This.

AVITI data has been an issue on the R package side of things as well. Even there we don’t have full guidance for AVITI users, in in part because we currently lack good test data (for example AVITI-sequenced mock community samples).

2 Likes

Hi @benjjneb & @Oddant1 ,

Thanks for the confirmation. I have analyzed my data with the Phred offset explicitly mentioned in both learnErrors()& derepFastq()functions. I am not sure if there are other places in the script where this would need this to be explicitly declared. But, I think this is good enough for now, and I will keep an eye out for updates and re-analyze if needed.

We have some actual data whose libraries were both concurrently sequenced on Illumina MiSeq and AVITI. We can also include some data from a few replicates of the Zymobiomics Mock community. Let me know if these would be of interest to create an AVITI-specific guide. If so, I can try to get approval from my collaborators to see if we can supply those.

Thanks for your assistance here!

1 Like

Vsearch tests on the fastq-test suite: PMC2847217

Some example fastq files that show the quality scores in question would be helpful!

That's ideal!

1 Like