The quality scores of the newest generations of Illumina machines (including NovaSeq in your case) use a new more "streamlined" binning system for quality scores which does not give the traditional continuous Phred scores you may be familiar with, rather it bins the values into 4 values only, thus the very artificial looking quality plots you see. But this is totally expected. Now, how these newer binning systems affect the downstream quality control/denoising is a different discussion.
The DADA2, it looks like, are going to release a new version that can specifically better handle this new data type (as per here), but this is not currently implemented in the q2-dada2 version and I'm not sure how well or bad DADA2 would perform with the binning scores, my guess its error model building step may not do so well, but that is pure speculation on my part!
You could always try merging your reads with q2-vsearch and running the output through Deblur. Of note, the Deblur pre-packaged error model was based on Illumina MiSeq, and as far as I'm aware it has never been benchmarked against the NovaSeq data. My guess however is that it would work fine because the NovaSeq is meant to have more accurate base-calling than MiSeq, meaning that you may be taking a more conservative approach to your QC here.
Btw, the above suggestions are based on the assumption that you have amplicon data (i.e. 16S, ITS), and NOT shotgun data.
Hi @Mehrbod_Estaki ,
Thank you very much for your answer.
Yes, I have amplicon data (16S). As I understood based on what you have said, I should use Deblur instead of DADA2, is that correct?
In your opinion, isn't better if I use Trimmomatic instead, and then after QC, I work in QIIME?
This would be my current approach within the QIIME 2 environment. With the upcoming DADA2 release, I would probably switch to that since at least that will have some benchmarks to back it up.
Another caveat you should be aware of is that Deblur tends to drop more reads as their length increases, see an example calculation here. So, if you find yourself losing too many reads after Deblur, maybe consider just using your forward reads only, you would lose some resolution but may retain many more reads. This would ultimately be based on what your overall goals are with the analysis and what you prioritize.
Having never used trimmomatic, I'm not sure what aspects of this approach would be better over the approaches I mentioned above, or with q2-vsearch if you rather not use the denoisers. Can you expand on what your plan would be?
Thanks again for your time and all the clear explanations that you are giving.
The reason that I was thinking to use trimmomatic for QC is due to the fact that after importing the raw reads using QIIME, I am not able to see the Interactive Quality Plot properly in order to make a decision for finding a threshold to trunc and trim the low quality reads. Then, I have realized that is because I am dealing with NovaSeq raw reads.
So, I was thinking to denoise using trimmomatic and then import the raw reads after the cleaning! and perform the rest of the analyses in QIIME, if that make any sense!!
I'm still not sure I understand the purpose of using trimmomatic first? Your plots look this way because the underlying quality scores are binned, so, they would look basically the exact same with any other tool you use, so not sure where the advantage is? Perhaps I don't know all the functions of trimmomatic, but isn't its main purpose to trim reads? If so, trimming can be done within QIIME 2 using q2-cutadapt, or within either q2-deblur or q2-dada directly, (whichever you choose to use), and it would just save you one extra step in transferring data between tools and keep your provenance complete right from the first step.
Thank you. I was rethinking about what you have told me and telling me and yes, now It is totally clear to me. I will not use trimmomatic to trim the data. I will follow you've said and will do it in QIIME.