Reasons for PacBio read loss during denoisong

Hi all,

Running dada2 on PacBio ccs reads, I am seeing a lot of my reads being filtered out (~97%), for some samples only. These samples correspond to a specific environment so there might be something there already... This is full length 16S.

The quality of the data looks fine for all samples and I do not see a difference between the samples that fail and those that are okay (both using fastqc or qiime demux summarize). Although I am not sure that the quality of the ccs reads are actually used or compatible with the denoising algorithm of qiime2?
image

I have tried using cutadapt to select reads with the primers and trim them before denoising with the "single" algorithm as well as using the "ccs" algorithm directly. Both give similar results. I am not truncating at all (--p-trunc-len 0), using the --p-max-ee option only (tried 1, 2 and 4 without much impact).

Could someone help me understand the reasons behind the read loss during both filtering and denoising?

Thanks!

1 Like

Good afternoon @Oxalis,

Welcome to the forums! :qiime2:

That quality plot does look suspiciously high. Q 40 = 99.99% accurate. As suggested here, I wonder if the wrong Q score / Phred offset was chosen during import.

What Phred number did you use? Have you tried others?

Hi and thanks for your answer Colin!

I have used SingleEndFastqManifestPhred33V2 and it has worked fine for other experiments.
Using SingleEndFastqManifestPhred64V2 triggers the error:

Decoded Phred score is out of range [0, 62].

1 Like

Thanks for checking. This implies that 'SingleEndFastqManifestPhred33V2' is correct.

Part of the inconsistency could be from not using --p-trunc-len, because if max-ee is constant, longer sequences will have more expected errors and will be removed. That should normalize read loss during filtering, and perhaps improve denoising.

How much do you expect the length of your 16S amplicon to vary in these samples?

The sequences will vary from 1.2 to 1.8kb. I thought --p-trunc-len 0 would mean keep all reads no matter the length? I also tried increasing the max-ee using 1, 2, 4 and even 10 but the impact is relatively low...

Ah, this makes sense.

Expected Error was introduced by Robert Edgar (muscle, usearch, uparse) because he noticed that Illumina errors were bi-modal; most reads were much better than the 99.9% advertised accuracy (Q30), but the overall run had lots of low-quality reads that were full of errors.

Instead of trimming off the low-quality ends of all reads, it worked better to drop the handful of reads that were bad throughout.

This was on the Illumina platform.
An expected error of 2 on a 250 bp region (like 16S V4) means 248/250 = 99.2% identical.
An expected error of 14 on 1.8k bp is around 99.22%

Pacbio is pretty different than Illumina, so I'm not even sure if it makes sense to treat the Q-scores the same way. Perhaps a different filtering or denoising method is needed.

The dada2 developer is on the forums. Perhaps we could ask them!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.