sequence length increase after pair-end merge?

ynano · February 26, 2021, 4:51am

Hi all,
I received pair-end fastq files for 20 samples(10WT and 10KO) separately. So I used “Fastq manifest” format to import in qiime2 v2020.8. After I ran "qiime demux summarize", I saw the quality plot as below:

I didn't see any boxplot, and I'm not sure how long should I trim the reads. So kept all and used 0 for both --p-trunc-len-f and --p-trunc-len-r in dada2 denoise-paired step, then I got the sequence length in rep-seqs.qza as this:

It seems before the ends merge I have sequence length 250 for both ends, but after merge I got mean length 300? How is that possible? Should I trim my reads? Is there something wrong in the quality plot?Thank you so much for your time!

timanix · February 26, 2021, 8:47am

Hi!
Which rRNA region you targeted on PCR step?
It is logical that joined reads are longer than forward and reverse reads since forward and reverse reads are merged by overlapping region.

ynano · February 26, 2021, 9:57am

Thank you for your answer. I got the data from my collaborator and I was told that the hypervariable V3 region of 16S rDNA was amplified and sequenced using an Illumina second-generation sequencing platform. Does the pattern for quality plot make sense? I'm not sure why it looks 'weird', as I saw other quality plots usually have a decreasing trend of quality scores at the end.

timanix · February 26, 2021, 10:08am

They looks nice to me. Check the merging stats - how many of reads were succesfully merged and retained for the analysis. If you think that you lost too much - try to apply trimming and check again. You should know approximate size of amplicons and only trimm the reads in a way that you have at least 20 nt of overlapping region between forward and reverse reads

llenzi · February 26, 2021, 10:45am

Hi @ynano,

just to add a note on the excellent answer from @timanix, on your box plot looking 'weird'. What it is showing is that your quality score are so close that the box-plot is shown as a horizontal line. That means potentially two things: (a) your sequences are really good; or (b) the quality score were transformed. The case (b) is not really unusual, it may happen for some sequencing provider or even upstream, the newest Illumina machines (NovaSeq) performs by default a binning for the quality scores which tend to look as your. So, I suggest to go back to your collaborator to ask more information on the used primers (which will lead to the expected amplicon size) and the exact Illumina machine used to get those sequences.
If the machine used is either MiSeq or HiSeq, double check if the facility did perform any quality transformation on the data, if not you a lucky one and good to go.
In case of the quality were transformed or the machine used is a NovaSeq (or you are in doubt), I would suggest to use deblur for denoising instead of dada2. The reason is that dada2 needs unchanged quality scores for its processes while deblur does not rely on quality scores!
Hope it helps
Luca

timanix · February 26, 2021, 10:56am

Hi @llenzi!
Thank you for a clarification.
So, to process reads from NovaSeq you would recommend qiime deblur denoise-16S plugin? In my case, I have paired reads (and @ynano as well). Should we use qiime vsearch join-pairs to merge the reads before deblur, or there is another way to denoise paired NovaSeq reads

llenzi · February 26, 2021, 11:22am

Hi @timanix,
I think is a good question!
Last I read on this was:

and

github.com/benjjneb/dada2

Consequences of using dada2 on NovaSeq data

opened 07:11PM - 14 Jun 19 UTC

hhollandmoritz

enhancement Priority

Hello, We have an amplicon dataset from a NovaSeq run and are exploring how …we might alter settings in the dada2 pipeline to effectively identify errors in our data. In case you are unfamiliar, NovaSeq generates up to 10 billion reads per flow cell and one of the ways Illumina deals with storing the massive amount of data generated by the NovaSeq is to simplify the error rates by binning the 40 possible quality scores into just 4 categories which vastly reduces the amount of information dada2 can work off of to infer errors in the data. Furthermore, the error-rate conversions are as follows: 0-2 -> 2 3-14 -> 12 15-30 -> 23 31-40 -> 37 So in some cases, error is being overestimated by the conversion (e.g. a score of 30 which is labelled 23) and in other cases it is being underestimated (e.g. a score of 31 being labelled 37). I see there as being two main places that this "binned" quality score has consequences, the quality filtering and the error-rate learning step. I'm less worried about the quality filtering as that is pretty easy to adjust the settings on, but I was wondering if you have suggestions about the ways we might alter the parameters of ```learnErrors``` to better estimate NovaSeq error rates. The first problem we encountered was the nbases parameter. NovaSeq runs are so large that with nbases set to 1x10^8 (our usual default) only one sample was being used to judge error rates. Do you have any recommendations for the minimum number of samples that should be used as the basis for error-learning? The second issue is the error estimation itself. When we run the ```learnErrors``` command on both our real NovaSeq data and simulated NovaSeq data (MiSeq data that we converted to have NovaSeq-style binned errors) we see a pretty characteristic error plot. Simluated data: ![simulated_NovaSeq_errR_plot](https://user-images.githubusercontent.com/7916220/59531929-8c3b7b80-8ea4-11e9-9b19-9f82c92d086b.png) Real NovaSeq data: ![NovaSeq_errR_plot2](https://user-images.githubusercontent.com/7916220/59531977-aa08e080-8ea4-11e9-907a-6f98f87b9d28.png) Pretty consistently, error plots underestimate the error frequency in certain ranges of the quality score landscape. In particular, they underestimate it in the 30-40 range (error plot models show a consistent "dip" in this region) and vastly over-estimate it in some parts of the 10-25 range. Do you have any recommendations about changes we might make to our analysis pipeline to improve the error estimation at this step? Thanks so much! Hannah

I am no sure if the most recent dada2 releases address the issue, as well as if any of these have been released into qiime2.

maybe @benjjneb could help on this to clarify?
Still, we are now go astray from the initial question, so I suppose if there are more question we better create a new topic for the sake of the forum!
Cheers
Luca

ynano · February 26, 2021, 7:35pm

Hi @llenzi,

This is very helpful, I will check more details with my collaborator. Thank you so much!

ynano · February 26, 2021, 8:11pm

The percent of reads merged range from 82 to 88 across samples. So I thought it is ok without any trimming. But I will check more information about sequence steps first. Thank you again for your help!

system · March 30, 2021, 2:11am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.