Running dada2 on PacBio ccs reads, I am seeing a lot of my reads being filtered out (~97%), for some samples only. These samples correspond to a specific environment so there might be something there already... This is full length 16S.
The quality of the data looks fine for all samples and I do not see a difference between the samples that fail and those that are okay (both using fastqc or qiime demux summarize). Although I am not sure that the quality of the ccs reads are actually used or compatible with the denoising algorithm of qiime2?
I have tried using cutadapt to select reads with the primers and trim them before denoising with the "single" algorithm as well as using the "ccs" algorithm directly. Both give similar results. I am not truncating at all (--p-trunc-len 0), using the --p-max-ee option only (tried 1, 2 and 4 without much impact).
That quality plot does look suspiciously high. Q 40 = 99.99% accurate. As suggested here, I wonder if the wrong Q score / Phred offset was chosen during import.
What Phred number did you use? Have you tried others?
Thanks for checking. This implies that 'SingleEndFastqManifestPhred33V2' is correct.
Part of the inconsistency could be from not using --p-trunc-len, because if max-ee is constant, longer sequences will have more expected errors and will be removed. That should normalize read loss during filtering, and perhaps improve denoising.
How much do you expect the length of your 16S amplicon to vary in these samples?
The sequences will vary from 1.2 to 1.8kb. I thought --p-trunc-len 0 would mean keep all reads no matter the length? I also tried increasing the max-ee using 1, 2, 4 and even 10 but the impact is relatively low...
Expected Error was introduced by Robert Edgar (muscle, usearch, uparse) because he noticed that Illumina errors were bi-modal; most reads were much better than the 99.9% advertised accuracy (Q30), but the overall run had lots of low-quality reads that were full of errors.
Instead of trimming off the low-quality ends of all reads, it worked better to drop the handful of reads that were bad throughout.
This was on the Illumina platform.
An expected error of 2 on a 250 bp region (like 16S V4) means 248/250 = 99.2% identical.
An expected error of 14 on 1.8k bp is around 99.22%
Pacbio is pretty different than Illumina, so I'm not even sure if it makes sense to treat the Q-scores the same way. Perhaps a different filtering or denoising method is needed.
The dada2 developer is on the forums. Perhaps we could ask them!