DADA2 filtering more than 60% of reads for PacBio SII data

Hi
dada2-ccs_stats2.qzv (1.2 MB)
samples.demux.summary.qzv (354.6 KB)

I am analyzing a Full length of 16S rRNA data with QIIME2. After the denoising step majority of the sequences are filtered out (~60%) and I am left with only little. I ran the following command. Can anyone help me understand what is going wrong here. I am also attaching my result and the sequencing QC data.

qiime dada2 denoise-ccs --i-demultiplexed-seqs ./samples.qza
--o-table dada2-ccs_table.qza
--o-representative-sequences dada2-ccs_rep.qza
--o-denoising-stats dada2-ccs_stats.qza
--p-min-len 1000 --p-max-len 1600
--p-max-ee 2
--p-front AGRGTTYGATYMTGGCTCAG --p-adapter RGYTACCTTGTTACGACTT
--p-n-threads 8

Hello Hunda,

Thank you for posting both your quality score plot and your DADA2 stats. I think I understand what's going on.

Given that these reads are CCS pac-bio reads, the quality is pretty good!

But because these reads are much longer than (commonly used) Illumina reads, maximum expected error is still removing lots of them.

--p-max-ee 2

This setting removes any reads with more than two expected errors in it, based on the cumulative errors predicted by the quality scores. Because your reads are 10x longer than Illumina reads, we would expect 10 more errors to show up in them with the exact same per-base quality.

Expected error is biased to favor short, high-accuracy reads, just like Illumina makes and Pac-Bio does not.

With that in mind, I would try raising --p-max-ee to a couple of different thresholds, say 5, 10, and 20 and see how many of your reads pass the filter. This is like using the --fastq_maxee_rate option in VSEARCH, which controls for error rate per base and is not biased against longer reads.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.