Error rates could not be estimated (Novaseq dataset)

ju4n_dc · November 25, 2021, 5:02am

Hi

I'm getting some weird errors when running qiime dada2 denoise-paired pipeline on a Novaseq dataset. As mentioned in this post, quality plots display a very rare pattern with quantized values, and the same is evident in the fastq file quality lines (all the quality scores read "F", ":" or ",").

About my dataset:

16S V3-V4 amplicon
250 b paired-end reads (Novaseq platform)
38 gut microbiota samples
23 M reads (600k per sample) dataset

Here I copy my command and the resulting errors:

qiime dada2 denoise-paired \
> --i-demultiplexed-seqs i04-demux.qza \
> --p-trunc-len-f 0 \
> --p-trunc-len-r 0 \
> --p-trim-left-f 0 \
> --p-trim-left-r 0 \
> --p-n-threads 6 \
> --o-table i04-table-dada2.qza \
> --o-representative-sequences i04-rep-seqs-dada2.qza \
> --o-denoising-stats i04-stats-dada2.qza \
> --verbose
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /tmp/tmp7lorj9n3/forward /tmp/tmp7lorj9n3/reverse /tmp/tmp7lorj9n3/output.tsv.biom /tmp/tmp7lorj9n3/track.tsv /tmp/tmp7lorj9n3/filt_f /tmp/tmp7lorj9n3/filt_r 0 0 0 0 2.0 2.0 2 12 independent consensus 1.0 6 1000000

R version 4.0.3 (2020-10-10) 
Loading required package: Rcpp
DADA2: 1.18.0 / Rcpp: 1.0.6 / RcppParallel: 5.1.2 
1) Filtering ......................................
2) Learning Error Rates
301902067 total bases in 1204373 reads from 2 samples will be used for learning the error rates.
^[	301166572 total bases in 1204373 reads from 2 samples will be used for learning the error rates.
Error rates could not be estimated (this is usually because of very few reads).
Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.
Execution halted
Traceback (most recent call last):
  File "/home/jdc/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 266, in denoise_paired
    run_commands([cmd])
  File "/home/jdc/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 36, in run_commands
    subprocess.run(cmd, check=True)
  File "/home/jdc/miniconda3/envs/qiime2-2021.4/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['run_dada_paired.R', '/tmp/tmp7lorj9n3/forward', '/tmp/tmp7lorj9n3/reverse', '/tmp/tmp7lorj9n3/output.tsv.biom', '/tmp/tmp7lorj9n3/track.tsv', '/tmp/tmp7lorj9n3/filt_f', '/tmp/tmp7lorj9n3/filt_r', '0', '0', '0', '0', '2.0', '2.0', '2', '12', 'independent', 'consensus', '1.0', '6', '1000000']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jdc/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2cli/commands.py", line 329, in __call__
    results = action(**arguments)
  File "<decorator-gen-514>", line 2, in denoise_paired
  File "/home/jdc/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/qiime2/sdk/action.py", line 244, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/home/jdc/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/qiime2/sdk/action.py", line 390, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/home/jdc/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 279, in denoise_paired
    raise Exception("An error was encountered while running DADA2"
Exception: An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

Plugin error from dada2:

  An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

See above for debug info.

I suspect DADA2 cannot handle this kind of "flaty quality" datasets. Any suggestions on how to proceed with the analysis would be helpful.

Perhaps it is convenient to use deblur in this dataset? I always used DADA2 in my previous analyses (data sequenced in MiSeq).

Are the ASVs obtained with DADA2 and deblur comparable? I would like to be able to do a longitudinal analysis of the microbiota by combining the previous Miseq data with this Novaseq dataset.

Attached find the demux.qzv visualization
i04-demux.qzv (307.0 KB)

Thank you very much in advance
Juan

Mehrbod_Estaki · November 26, 2021, 11:09pm

Hi @ju4n_dc,
This is a great question and one we are seeing more often on the forum, and I suspect we'll see more moving forward as NovaSeq and the binning quality scores become more common.

The q2-dada2 is just a wrapper of DADA2 and so will face the same issues as the native version. Using DADA2 for NovaSeq data is doable with some modification (see this discussion here) but that will require you to either use it natively in R, or create your custom branch of q2-dada2 with modifications mentioned in that post. That being said, the developer of DADA2 mentions in that thread that while things look ok, DADA2 has not been thoroughly tested with NovaSeq data, so just be mindful of that if you choose to go that way.
As for using Deblur instead, that is a bit different because Deblur actually uses a pre-trained model for denoising so it technically will not care for the quality scores at all. So, technically you will have no problem running this data through Deblur. That being said, Deblur's error model was based on the MiSeq and HiSeq machines and I'm not aware of any benchmarking of its use with NovaSeq (or other) sequencers. Given that the NovaSeq technology is claimed to have less error rates than MiSeq and HiSeq, I would speculate that applying the more conservative MiSeq/HiSeq error model is ok here, though a more customize model might result in higher # of reads.

The results are comparable , however, using the same bioinformatics pipeline would be more ideal and I would personally recommend that. I'll also add that as read lengths get longer (beyond 100-150 nts) DADA2 tends to retain more reads than Deblur due to the nature of the denoising algorithms (relevant explanation here). So consider sequencing depth as an important factor as well if you are going to mix and match denoisers.

I'd love to hear what others think on this topic, something we'll be facing more on the forum soon I'm sure.

ju4n_dc · November 29, 2021, 2:00pm

Hi Mehrbod,

Thanks for your reply, the links were really helpful. I understand then that there's no easy way to solve this error within QIIME2, lets say, by changing denoising parameters in the QIIME2 command. Unfortunately I don't manage to use R for now, so I can't run DADA2 natively.

I think I'm going to give it a try using deblur with my MiSeq and Novaseq sequences together. From what I have read there may be minor differences, but the overall biological result should not be altered.

I still don't understand why Illumina decided to change the output from their new platforms, as disaggregated quality on a base-by-base basis is such a valuable parameter.

Anyway, hope to be able to use DADA2 again in the near future,

Mehrbod_Estaki · November 29, 2021, 6:00pm

Illumina explains their new binning system and why they changed it here. In short, they claim it is a simplified version without losing sensitivity and more importantly it saves computation time and storage space. This is a big issue as sequencing depth becomes exponentially deeper with every new generation of sequencers. The software development to optimally utilize these scores will catch up shortly I'm sure.

colinbrislawn · December 2, 2021, 1:40am

Thank you for bringing this to my attention, Mehrbod!

With the change in Q score, did they keep the fastq format the same (and just use 4 values for Q), or are they incodeing fewer letters in the quality line of the fastq file?

ju4n_dc · December 2, 2021, 3:52pm

Sorry, I expressed myself wrong. Only four Q values are used but there is still one Q-value for each base in the fastq file. Still I find it a step back losing detailed non-binned quality scores.

Mehrbod_Estaki · December 2, 2021, 7:42pm

Hi @ju4n_dc,
Having had a second look at your error I wanted to clarify a couple of additional things.
Your original error message:

This suggests that you don't have enough reads for DADA2 to build a reliable error model.

But this line suggests that DADA2 did in fact have access to enough reads as by default it only requires 1 million reads. So the fact that it wasn't able to move pass the error-building step is what makes me think the quality scores are not playing nicely with the model building. But this is speculation on my part! @timanix also raised the possibility of running out of memory because NovaSeq datasets are massive, so certainly something to consider even if you sort out the problem. My hunch is that at that step it is not a memory issue (yet) but something to consider later on.

The FASTQ format looks like the same with just 4 characters in the quality lines, but, I'm not sure what this looks like at the raw data in the sequencer itself. I have to think they have a different format in saving the raw binary files in the sequencer which they then convert to regular FASTQ. Otherwise, I don't really see where the "space saving" aspect comes in to play