I am analysing PacBio Sequel II full-length 16S rRNA CCS reads (~1450 bp) using the DADA2 long-read workflow and observing an unusually high number of unique sequences. During dereplication:
derepFastq()
almost all reads appear unique (e.g., ~11,200 unique reads from ~12,300 total reads). After denoising only a small number of reads remain.
Is such a high unique/read ratio normal for PacBio full-length 16S CCS data? Could this be related to sequence orientation, primer trimming, or filtering parameters?
Any suggestions for diagnosing or resolving this issue would be appreciated.
Hello @Muhammad_Ali,
Welcome to the forums! 
This is a great question, as 90% of the reads being unique is unexpected for an Illumina 16S study. But is it normal for PacBio reads?
I found this repo, from when DADA2 was first being adapted for use with long, noisy data. GitHub - benjjneb/LRASManuscript: Reproducible Analyses accompanying DADA2 + PacBio Manuscript · GitHub
I've collected results from these 4 studies below.
DADA2 + PacBio: ZymoBIOMICS Microbial Community Standard - Sample 1 - 69367 reads in 20516 unique sequences
DADA2 + PacBio: HMP Mock Community -
Sample 1 - 69963 reads in 19873 unique sequences.
DADA2 + PacBio: Fecal Samples -
## Sample 1 - 17147 reads in 4468 unique sequences.
## Sample 2 - 26190 reads in 4958 unique sequences.
## Sample 3 - 24081 reads in 5326 unique sequences.
## Sample 4 - 16474 reads in 3668 unique sequences.
## Sample 5 - 28171 reads in 6385 unique sequences.
## Sample 6 - 24453 reads in 5529 unique sequences.
## Sample 7 - 21411 reads in 5519 unique sequences.
## Sample 8 - 18429 reads in 5152 unique sequences.
## Sample 9 - 13258 reads in 3772 unique sequences.
DADA2 + PacBio: S. aureus from Wagner et al. 2016 -
Sample 1 - 4056 reads in 1457 unique sequences.
Yes, these number seem high. Something could be wrong.
Good thinking! Out of all of these, extra primers are guaranteed to make more of the reads unique. Maybe the adapters could be cut off with cutadapt?
Good luck! Let us know what you try next!
Looks like this was cross posted on the DADA2 GitHub: PacBio full-length 16S CCS data: almost all reads unique during dereplication and very few reads retained after denoising · Issue #2182 · benjjneb/dada2 · GitHub