DADA2 for Single-end sequences from .FASTQ

Hello,

I work with 16S samples from LifeLines-DEEP Cohort, and use q2-dada2 plugin for my analysis.

I successfully managed to import my data in q2 using a manifest file:

!qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-path manifest.tsv \
  --input-format SingleEndFastqManifestPhred33V2 \
  --output-path single-end-demux.qza

and the quality plot looks as follows:

Now, I'm trying to do DADA2 analysis:
! qiime dada2 denoise-single
--i-demultiplexed-seqs test_single-end-demux.qza
--p-trim-left 17
--p-trunc-len 245
--p-max-ee 2
--p-trunc-q 2
--p-n-threads 0
--output-dir LL_DADA2_denoising_output
--verbose

This gives me an error at the step of learning error rates:

  1. Learning Error Rates
    149883798 total bases in 592718 reads from 8 samples will be used for learning the error rates.
    Error rates could not be estimated (this is usually because of very few reads).
    Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.
    Execution halted
    ...
    Plugin error from dada2:
    An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

From the error above I understood that the hyperparameters I had chosen somehow filter out all my reads, but when I relax those hyperparameters as mush as possible, I still have the same error:
! qiime dada2 denoise-single
--i-demultiplexed-seqs test_single-end-demux.qza
--p-trim-left 0
--p-trunc-len 0
--p-max-ee 2
--p-trunc-q 0
--p-n-threads 0
--output-dir LL_DADA2_denoising_output
--verbose

Unfortunately, I have no domain knowledge about this data and I'm not 100% sure how to pick those hparams, but a little experiment I explained above seems to have some logic.

Could you, please, explain what am I possibly doing wrong?

Hi @Oleg ,

At first glance, I don't think you are doing anything wrong, and I don't think that there is anything wrong with the parameters that you are setting, per se. I think it is an issue with the original dataset.

Based on the quality profile, it would appear that all bases have the same Q score, so dada2 cannot build an appropriate error matrix, hence:

I recommend inspecting the raw fastqs to see what the Q scores look like — could you post the first few lines here?

2 Likes

Hi, @Nicholas_Bokulich

Thank you for your response, the first two lines of one .fastq file with forward reads only:
@some_ID
TACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGCGCAGCAAGTCTGATGTGAAAGGCAGGGGCTTAACCCCTGGACTGCATTGGAAACTGCTGTGCTTGAGTGCCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAGCACCAGTGGCGAAGGCGGCTTACTGGACGGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACAGG
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@some_ID
TACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGTGTGGCAAGTCTGATGTGAAAGGCATGGGCTCAACCTGTGGACTGCATTGGAAACTGTCATACTTGAGTGCCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAGCACCAGTGGCGAAGGCGGCTTACTGGACGGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACAGG
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

Yep, as I suspected — it is indeed that the Q-scores are all uniform (presumably in all sequences, not just the first few).

This is impossible — it indicates that the authors of the study most likely created artificial Q scores for the reads, possibly because the original quality scores were lost. You could contact them for the true Q scores/explanation.

So these data will not work with dada2. You could use deblur (which does not use Q scores) for denoising.

Good luck!

I see, so those "HHH..." after a plus sign tells you that, normally it would be another combination of protein bases. Thank you very much!
I will try to reach out to the data owners or try your advice about deblur.

All the best,
Oleg

those are PHRED quality scores, not protein bases — but yes normally there would be different characters, indicating different quality scores.

Thank you, Nicholas!
I've managed to get the quality scores for this data, and the quality plot looks now as follows:

And now when I run dada2, there is a new error:

There were some problems with the command:
(1/3?) no such option: --p-trim-left-f Did you mean --p-trim-left?
(2/3?) no such option: --p-trunc-len-f Did you mean --p-trunc-len?
(3/3?) no such option: --p-max-ee-f Did you mean --p-max-ee?

I've seen a similar problem on forum, but it looks like the guy simply mixed up commands
qiime dada2 denoise-paired
and
qiime dada2 denoise-single

In my case, I have only forward read and use the correct command which gives me the error above.
Here is my code:
qiime dada2 denoise-single
--i-demultiplexed-seqs single-end-demux.qza
--p-trim-left-f 18
--p-trunc-len-f 158
--p-max-ee-f 2
--p-trunc-q 2
--p-n-threads 0
--output-dir DADA2_denoising_output
--verbose

Do you have any idea about that?

Thank you for your help, really appreciate it!

Hi @Oleg, it is the same problem as what you linked to above, you're mixing up parameters from two different commands. The hint is telling you what you can do to fix this (if you mean to use denoise-single:

Let us know how it goes!