NovaSeq and Dada2 incompatibility.

Johanndb · March 22, 2023, 8:24am

Dear All,

I have been analysing NovaSeq 6000 16s rRNA V4-V5 data for the past few months. This is my first experience with NGS data analysis. As a novice, I was sideswiped by the fact that NovaSeq results were very different due to their binned quality scores. From multiple posts, the consensus is that Dada2 is not appropriate for NovaSeq sequences unless you can enforce monotonicity.

This leads me to the question, has Qiime2 implemented a way to run the Dada2 denoising algorithm by enforcing monotonicity? As the alternatives all point to R. Lastly would there be a possibility to add a note to the tutorials which include Dada2 in their analyses, to state that this is incompatible with binned quality scores (or specifically NovaSeq)?

Thank you for the continued great work with Qiime2!

Kind regards,

Johann

timanix · March 23, 2023, 8:07am

Hi!
Three options how to proceed with NovaSeq data are listed here or here.

Best,

Johanndb · March 23, 2023, 8:43am

Dear @timanix anix

Thank you for compiling the posts. I have seen and followed these suggestions. However, it is not clear from the Qiime2 tutorial (e.g. Moving Picture and Parkinson's Mouse) that NovaSeq samples should be analysed differently.

Therefore, I suggest adding a note in these tutorials to highlight the fact that you can't use dada2 through Qiime2 to process your data. As a newcomer, this was very unclear and I think it would help a lot of users not fall into the same trap as I did.

I hope I made this clear.

Kind regards,

Johann

timanix · March 23, 2023, 8:48am

I would not say that one can not use dada2 as it is with NovaSeq data. It still can be processed in qiime2 and dada2.

Best,

Johanndb · March 23, 2023, 9:53am

Hi @timanix

I don't think you quite understand the point I am trying to make. Could you please elaborate on what you mentioned?

As for all the posts you link, they specify utilising R or Q2-dada2. Even then, there is no official Qiime2 document where it specifies that you cannot use dada2 as is with NovaSeq data. There is also no note within the Qiime2 documentation that highlights the fact that the data analysis for NovaSeq should be different when utilising dada2. Please see Consequences of using dada2 on NovaSeq data · Issue #791 · benjjneb/dada2 · GitHub and Binned quality scores and their effect on (non-decreasing) trans rates · Issue #1307 · benjjneb/dada2 · GitHub where the data scientists explain you need to enforce monotonicity when utilising dada2 with NovaSeq.

There is no documentation in Qiime2 where anyone mentions that you can enforce monotonicity in Qiime2. For someone not aware of this issue, this could be problematic. When you follow a tutorial or read about the different analyses, it is not made clear anywhere on Qiime2 that the standard dada2 analysis in Qiime2 is not appropriate with NovaSeq data. No tutorial or documentation mentions this as a limitation of dada2.

Therefore, is there any way to add a note to the documentation that mentions this as a potential limitation? This will allow new users to immediately realise that dada2 in Qiime2 is not appropriate with NovaSeq data.

Please let me know if anything is unclear.

Kind regards,

Johann

timanix · March 23, 2023, 10:47am

As well as I remember, there were several reports (can't find all of them now, read more than a year ago when started to work with NovaSeq, but here is one) that forcing monotonicity in the error learning model in Dada2 did not resulted in great differences with running it as it is on NovaSeq data. Also, some issues were not related to dada2 error learning model. Here, here and here developers are recommending to proceed as it is since no drastic affects are discovered with mock test.

However, I would agree that some kind of warning should be marked for users who are working with NovaSeq data, though I am not sure if that can be added to the tutorials. @Nicholas_Bokulich, what is your opinion regarding the issue in the topic? Should it be stated in the tutorials, that NovaSeq data have some issues with current dada2 versions?

As well as I understood, dada2 developers do not state that using dada2 is not appropriate with NovaSeq data.

Best,

Johanndb · March 23, 2023, 11:12am

Hi @timanix

Thank you for elaborating on your previous post. This comprehensive answer makes a lot more sense to me. This is great news then! My apologies, it does then seem that dada2 should be appropriate to use with NovaSeq data in the cases where Unique Dual Indices were used.

Thank you kindly for your explanation regarding dada2 and NovaSeq!

Kind regards,

Johann

Micro_Biologist · March 23, 2023, 11:21am

I'm pretty certain the current advice is proceed with cation and if you see oddities in the data consider enforcing monotonicity. I'm not sure if it actually made it into release but it should be in the standalone version of dada2 (dada2 code optimization for binned quality scores from NovaSeq Data · Issue #425 · nf-core/ampliseq · GitHub).

In theory once qiime2 uses a version of dada2 with this feature it should be a matter of passing the option through. What might be a nice feature is looking at the first X sequences of a dataset and looking to see how many different Q scores there are and if it looks like it has been binned then passing an error/warning.

EDIT: In my haste it seems that it is not implemented into dada2 but as a step in the ampliseq pipeline.

Nicholas_Bokulich · March 23, 2023, 12:56pm

Hello thanks for pinging me in, @timanix ,

I think you all make good points. On the one hand, @timanix has clearly referenced discussions in the dada2 github issue tracker etc that enforcing monotonicity does not necessarily impact the outcome, and that binned quality scores are not necessarily incompatible with dada2/q2-dada2 for using directly.

On the other hand, @Johanndb you make the good point that binned quality scores have been reported to lead to poor error models at least in some instances. So even if the problem is not frequent, it is still a problem that can be addressed by better oversight on the user part. I agree that changing the docs to encourage researchers to do their due diligence is a good idea.

Note that the tutorials are all open-source and community developed, so @Johanndb you would be very welcome to submit a pull request to the QIIME 2 documentation on github if you would like to contribute a change. Otherwise, I think that the Q2 team will need to do some brainstorming before deciding exactly where/how to address this.

Thanks for the fruitful discussion everyone.

benjjneb · March 25, 2023, 2:18am

i wanted to add to this thread my perspective as developer/maintainer of the DADA2 software package.

We think DADA2 works fine with binned-quality-score data such as NovaSeq, at least in the vast majority of cases. While there are some oddities when inspecting the error model that DADA2 learns from binned quality scores, there don't seem to be oddities when evaluating the results of denoising on NovaSeq (with the caveat that this is based on a relatively limited number of test samples). Furthermore, the oddities in the DADA2 error model with binned quality scores often appear more important than they are -- typically the error model is well-fit at the few binned quality scores that dominate the data, and the visual deviations that appear in diagnostic plots occur only at quality scores that barely appear in the data. That's a consequence of the simple but fairly robust loess fitting used to learn the DADA2 error model.

To sum up: As far as we can discern, DADA2 works fine with binned quality score data like NovaSeq. Undoubtedly it could be optimized further, but I can say that as the developer of DADA2 I use it on NovaSeq data without concern. That said, as always -- do your sanity checks!

asbarros · August 6, 2024, 3:21pm

Hey everyone,

I am now dealing with NovaSeq data and all of the issues raised in this point.

A way to deal with this in QIIME2 might be getting the error model plots that are obtained from R. This can then more easily help to check "problematic datasets".

Thanks in advance!