sequence length increase after pair-end merge?

llenzi · February 26, 2021, 11:22am

Hi @timanix,
I think is a good question!
Last I read on this was:

and

github.com/benjjneb/dada2

Consequences of using dada2 on NovaSeq data

opened 07:11PM - 14 Jun 19 UTC

hhollandmoritz

enhancement Priority

Hello, We have an amplicon dataset from a NovaSeq run and are exploring how …we might alter settings in the dada2 pipeline to effectively identify errors in our data. In case you are unfamiliar, NovaSeq generates up to 10 billion reads per flow cell and one of the ways Illumina deals with storing the massive amount of data generated by the NovaSeq is to simplify the error rates by binning the 40 possible quality scores into just 4 categories which vastly reduces the amount of information dada2 can work off of to infer errors in the data. Furthermore, the error-rate conversions are as follows: 0-2 -> 2 3-14 -> 12 15-30 -> 23 31-40 -> 37 So in some cases, error is being overestimated by the conversion (e.g. a score of 30 which is labelled 23) and in other cases it is being underestimated (e.g. a score of 31 being labelled 37). I see there as being two main places that this "binned" quality score has consequences, the quality filtering and the error-rate learning step. I'm less worried about the quality filtering as that is pretty easy to adjust the settings on, but I was wondering if you have suggestions about the ways we might alter the parameters of ```learnErrors``` to better estimate NovaSeq error rates. The first problem we encountered was the nbases parameter. NovaSeq runs are so large that with nbases set to 1x10^8 (our usual default) only one sample was being used to judge error rates. Do you have any recommendations for the minimum number of samples that should be used as the basis for error-learning? The second issue is the error estimation itself. When we run the ```learnErrors``` command on both our real NovaSeq data and simulated NovaSeq data (MiSeq data that we converted to have NovaSeq-style binned errors) we see a pretty characteristic error plot. Simluated data: ![simulated_NovaSeq_errR_plot](https://user-images.githubusercontent.com/7916220/59531929-8c3b7b80-8ea4-11e9-9b19-9f82c92d086b.png) Real NovaSeq data: ![NovaSeq_errR_plot2](https://user-images.githubusercontent.com/7916220/59531977-aa08e080-8ea4-11e9-907a-6f98f87b9d28.png) Pretty consistently, error plots underestimate the error frequency in certain ranges of the quality score landscape. In particular, they underestimate it in the 30-40 range (error plot models show a consistent "dip" in this region) and vastly over-estimate it in some parts of the 10-25 range. Do you have any recommendations about changes we might make to our analysis pipeline to improve the error estimation at this step? Thanks so much! Hannah

I am no sure if the most recent dada2 releases address the issue, as well as if any of these have been released into qiime2.

maybe @benjjneb could help on this to clarify?
Still, we are now go astray from the initial question, so I suppose if there are more question we better create a new topic for the sake of the forum!
Cheers
Luca