Hi,
I am processing my pacbio 16S rRNA reads through denoise-ccs and as stats-dada2.qzv (1.2 MB)
you can see from the table I am losing almost 60% of my reads at the denoising step. Could anyone help me to improve this? I have tried changing the --p-max-ee but that does not help..
The go to place to start when you are losing more reads than you expected, is to really double check that all non-biological sequences have been removed from your reads, so that is a good place to start. You also might want to watch these videos from one of our workshops about both the theory of denoising and actually running the tools: Lecture & Plugin Tutorial.
However, in your case, I think it likely due to something we are not aware of yet in regards to how DADA2 processes CCS reads, could you try removing --p-min-len and --p-max-len and see if that helps?
Thanks @Keegan-Evans, I have seen those videos, they are very good and informative. I tried without --p-min-len and --p-max-len which did not change the outcome and I am still losing almost the same amount of reads. I was wondering, If removing these parameters, isn't that set it to default mode? It looks like the dereplication steps is where I am losing the max amount of reads..?
@kumars,
Can you tell me more about the wet lab processing that was performed on your samples prior to sequencing? If you do not know this, do you know which PacBio machine your sequences were read on?
If no PCR was performed, it might be that DADA2 is seeing true, unique sequences that are very close to other sequences and eliminating them as chimeric. The number of required "parent" sequences is set with the --min-fold-parent-over-abundance parameter, which is by default set to 3.5, you might try decreasing it, even as low as 1, though you would probably want to review the literature on this before publishing, as I am not aware of any studies that I have seen previously that evaluate the accuracy or reliability of this approach.
Overall, the subject of denoising long read sequences is not nearly as well studied as that of denoising short read sequences, and so I am struggling with any "rules of thumb" or guidelines I can give to you. I think the best approach at this point may to be trying out various denoising parameters, and hopefully sharing your results here so that they can be discussed by the community and we can learn from your efforts.
It also might be worth playing around with allowing indel errors in the primer region. While this can end up being around 4x slower, it might be worth doing as ~80% of errors using CCS are indels. To allow detection of prime sequences containing indel errors, you can use the --p-indels parameter flag.
@Keegan-Evans ,
Prior to sequencing, PCR was performed to amplify ~ full-length 16S rRNA using 27F and 1492R primer pairs. PacBio Sequel II sequencing platform was used.
I have tried another DADA2 run with "--min-fold-parent-over-abundance" 1. The outcome is still the same. Indeed denoising full-length/long reads do seem to require a bit more attention. Since I am removing the primer and processing only those sequences that have primer attached so I am not sure '--p-indels' in this case make more difference. But again I am only taking baby steps toward that. I can give this a go on this if you still think it worth.
I am also happy to share the raw sequences as well for you to have a look..?