I have been analysing NovaSeq 6000 16s rRNA V4-V5 data for the past few months. This is my first experience with NGS data analysis. As a novice, I was sideswiped by the fact that NovaSeq results were very different due to their binned quality scores. From multiple posts, the consensus is that Dada2 is not appropriate for NovaSeq sequences unless you can enforce monotonicity.
This leads me to the question, has Qiime2 implemented a way to run the Dada2 denoising algorithm by enforcing monotonicity? As the alternatives all point to R. Lastly would there be a possibility to add a note to the tutorials which include Dada2 in their analyses, to state that this is incompatible with binned quality scores (or specifically NovaSeq)?
Thank you for the continued great work with Qiime2!
Thank you for compiling the posts. I have seen and followed these suggestions. However, it is not clear from the Qiime2 tutorial (e.g. Moving Picture and Parkinson's Mouse) that NovaSeq samples should be analysed differently.
Therefore, I suggest adding a note in these tutorials to highlight the fact that you can't use dada2 through Qiime2 to process your data. As a newcomer, this was very unclear and I think it would help a lot of users not fall into the same trap as I did.
There is no documentation in Qiime2 where anyone mentions that you can enforce monotonicity in Qiime2. For someone not aware of this issue, this could be problematic. When you follow a tutorial or read about the different analyses, it is not made clear anywhere on Qiime2 that the standard dada2 analysis in Qiime2 is not appropriate with NovaSeq data. No tutorial or documentation mentions this as a limitation of dada2.
Therefore, is there any way to add a note to the documentation that mentions this as a potential limitation? This will allow new users to immediately realise that dada2 in Qiime2 is not appropriate with NovaSeq data.
As well as I remember, there were several reports (can't find all of them now, read more than a year ago when started to work with NovaSeq, but here is one) that forcing monotonicity in the error learning model in Dada2 did not resulted in great differences with running it as it is on NovaSeq data. Also, some issues were not related to dada2 error learning model. Here, here and here developers are recommending to proceed as it is since no drastic affects are discovered with mock test.
However, I would agree that some kind of warning should be marked for users who are working with NovaSeq data, though I am not sure if that can be added to the tutorials. @Nicholas_Bokulich, what is your opinion regarding the issue in the topic? Should it be stated in the tutorials, that NovaSeq data have some issues with current dada2 versions?
As well as I understood, dada2 developers do not state that using dada2 is not appropriate with NovaSeq data.
Thank you for elaborating on your previous post. This comprehensive answer makes a lot more sense to me. This is great news then! My apologies, it does then seem that dada2 should be appropriate to use with NovaSeq data in the cases where Unique Dual Indices were used.
Thank you kindly for your explanation regarding dada2 and NovaSeq!
In theory once qiime2 uses a version of dada2 with this feature it should be a matter of passing the option through. What might be a nice feature is looking at the first X sequences of a dataset and looking to see how many different Q scores there are and if it looks like it has been binned then passing an error/warning.
EDIT: In my haste it seems that it is not implemented into dada2 but as a step in the ampliseq pipeline.
I think you all make good points. On the one hand, @timanix has clearly referenced discussions in the dada2 github issue tracker etc that enforcing monotonicity does not necessarily impact the outcome, and that binned quality scores are not necessarily incompatible with dada2/q2-dada2 for using directly.
On the other hand, @Johanndb you make the good point that binned quality scores have been reported to lead to poor error models at least in some instances. So even if the problem is not frequent, it is still a problem that can be addressed by better oversight on the user part. I agree that changing the docs to encourage researchers to do their due diligence is a good idea.
Note that the tutorials are all open-source and community developed, so @Johanndb you would be very welcome to submit a pull request to the QIIME 2 documentation on github if you would like to contribute a change. Otherwise, I think that the Q2 team will need to do some brainstorming before deciding exactly where/how to address this.
i wanted to add to this thread my perspective as developer/maintainer of the DADA2 software package.
We think DADA2 works fine with binned-quality-score data such as NovaSeq, at least in the vast majority of cases. While there are some oddities when inspecting the error model that DADA2 learns from binned quality scores, there don't seem to be oddities when evaluating the results of denoising on NovaSeq (with the caveat that this is based on a relatively limited number of test samples). Furthermore, the oddities in the DADA2 error model with binned quality scores often appear more important than they are -- typically the error model is well-fit at the few binned quality scores that dominate the data, and the visual deviations that appear in diagnostic plots occur only at quality scores that barely appear in the data. That's a consequence of the simple but fairly robust loess fitting used to learn the DADA2 error model.
To sum up: As far as we can discern, DADA2 works fine with binned quality score data like NovaSeq. Undoubtedly it could be optimized further, but I can say that as the developer of DADA2 I use it on NovaSeq data without concern. That said, as always -- do your sanity checks!
I am now dealing with NovaSeq data and all of the issues raised in this point.
A way to deal with this in QIIME2 might be getting the error model plots that are obtained from R. This can then more easily help to check "problematic datasets".