My data is 2 x 150 bp demultiplexed paired-end Novaseq which has already been filtered by the sequencing centre: reads containing PhiX control signals were removed and reads containing (partial) adapters were clipped (up to a minimum read length of 50 bp) by the sequencing centre.
Due to the pre-filtering, it has left approx 25% of the reads between 50bp - 135bp.
Additionally, after searching the forum, the demux plot of my data is similar to others with Novaseq data Odd display of demux plot, Interpretation of demux.qzv, where the quality scores have been binned at 2, 11, 25, 37.
The issue has also been raised here, with benjjneb recommending enforced monotonicity (if using DADA2 denoising). I'm also aware that DADA2 doesn't have an "official “production” solution for NovaSeq data yet". Thus, I am wondering what the best approach is for denoising.......
Regarding the variation in read length, what's the best approach? Do I set the truncation f/r parameters at at 0 (because the overall quality is good), but then DADA2 requires pretty much uniform reads right? What about Filtering out ASVs from DADA2 based on length - #4 by thermokarst?
Looking at your demultiplexed data, it appears to me that you that you may have already had some quality filtering applied. (I have worked less wiht Novaseq, but this doesn’t look the way I’d expect an Illumina quality profile to look normally). Based on that assumption, I would recommend using deblur. I’m not sure it’s been benchmarked for NovaSeq, either, which is a challenge.
Okay, so I just got an update from the brilliant @Nicholas_Bokulich. Novaseq does has changed their error modeling/Phred score between the MiSeq/HiSeq and Novaseq, compressing that error space. I still think Deblur is probably an easier solution than shoe-horning DADA2, but it looks like you’ve done a fair bit of background reading already if you want to do it in R and import into QIIME.
Hi @jwdebelius. Ok fantastic! Thanks for your help. I will push ahead with deblur and maybe try the DADA2 approach as well.
Any suggestion on how to deal with the variation in amplicon read length in terms of truncation parameters? Would it be best to set p-trunc-len at 0 (good quality overall) for both forward and reverse in order to include shorter reads which have adapters/Phix clipped?
I think it’s reasonable to try for Dada2. Run it that way, and see how the data looks. If you lose a bunch of reads in quality filtering/denoising, maybe consider a shorter length.
For deblur, you need to set a truncation length because its part of the algorithm.