I'm processing our first 16S amplicon seq data from Element AVITI and experiencing problems in dada2 denoising. Based on phred scores, the read quality is great, much better than we usually get from MiSeq. But if we truncate the reads based on phred scores like we would for MiSeq data (even >30 at 25th percentile gives longer reads than MiSeq usually at >20), most reads are discarded already at the filtering phase (like, >80%!). If we do much harsher truncation, most reads pass the filter, but majority is discarded later as chimeric, and we end up with something like 30% remaining.
Basically I still have enough data to work with, as the initial read counts are huge (>1 million!), but I'm worried that something is wrong.
I imported the data as PairedEndFastqManifestPhred33V2:
I now did a few tests with a set of 5 samples, with various truncations. It seems it's (again!) about the reverse reads, but I'd like to understand why they fail if we use them up to ~253 even though the fastq quality is >25 (25th percentile).
ZYMO is the ZymoBiomics microbial community standard so super simple "microbiota". COWFECAL and SF are fecal microbiota so high biodiversity. The others are probably somewhere in between.
Keep in mind that reads shorter than the truncation length are discarded. This is probably why you don't get great results with the first attempt of forward truncation at 307 -- the median forward read length is 306 (this information is below the quality plots in the demux visualizer). You tried forward 299 reverse 222 which did pretty well (at least as far through the pipeline as the merging step), so it's also worth trying around 300 forward and 256 reverse to have one run that maximizes both lengths.
It seems it's (again!) about the reverse reads, but I'd like to understand why they fail if we use them up to ~253 even though the fastq quality is >25 (25th percentile).
Why do you say this? From looking at the dada2 stats visualization, the step that seems to be filtering the most reads is chimera removal. This can happen if synthetic sequences (primers, barcodes) aren't removed from the reads. What quality control steps were performed upstream?
Ah, indeed, thanks, I was being stupid! I was just eyeballing the phred quality plot and forgot to check the read lengths Perhaps there indeed isn’t anything very strange going on.
I eyeballed a subset of sequences and verified that everything else except the 16S specific parts of the primers had been removed at the sequencing core. I then removed those using CutAdapt and checked the whole data using zgrep (and again a subset by eyeballing) that it was succesful. So that is not the reason for high proportion of chimeras. I wonder if that could be due to too much PCR amplification - although we only used 14 cycles in the 1st round PCR; the products were perhaps a tad stronger than necessary on gel.
I didn’t personally check the quality with fastqc/multiqc/etc because I presume that was done previously, but perhaps I should do it.
I'm not sure how much more tools like fastqc will tell you given you're confident that the primers have been trimmed and you can see the quality information in the demux visualizer, but it couldn't hurt.
I'm still wondering if the AVITI phred scores are a tad optimistic compared to MiSeq. It seems more sequences are accepted in dada2 (default settings) if I truncate closer to the theoretical minimum (towards the minimal overlap). I've so far got best results using 250/204, although FW phred is >40 all the way to >300 and also REV is almost phred 30 until about 250 ( (25th percentiles).
I thought longer overlap would be better but do I get something wrong here?
I'm not familiar with the AVITI platform. As far as getting better results with shorter truncations, this is likely not because the overlap region is smaller but because there are fewer lower quality bases to cause reads to be removed wholesale--the reads are being lost at the filtering step not the merging step.
Ultimately I would just move forward with the truncation parameters that give you the most retained reads.
I'm still wondering about the phred scores in the different sequencing instruments. Viewed in demux.qzv, MiSeq tops at 38, AVITI at 45. It seems we need to truncate AVITI reads to phred >35 to get decent dada2 results, whereas MiSeq is OK down to 20-25, sometimes even 15 if we must (talking about 25th percentiles). Is this typical to different instruments or am I getting something wrong?
Unless the MiSeq data you're referring to used the same region, primers, library prep, etc. there are probably too many other variables to be able to make that comparison. AVITI does use a different sequencing technology than Illumina so there's a different algorithm to come up with the quality scores. Of course a probability is a probability so the scores should have identical interpretations between technologies.
Sorry, I should have mentioned that this indeed was the same region, same primers, same library prep (as far as it goes for these different instruments). And basically the same type of biological sample material and the same DNA extraction method - but not the exact same templates as this was not a systematic comparison.
But yeah, you're of course right - I guess I basically meant to ask if the apparently quite differently set phred scores (despite the same phred format) could be a problem for dada2, or does it matter at all (except for the human being making the decision on truncate positions)?
I got a potentially interesting comment from AVITI bioinformatician. They think the phred scoring should be quite comparable vs MiSeq BUT: ”There are differences in the distribution of q scores within a read, however, which would be relevant to DADA2 filtering. AVITI data will have greater q score variance within a read--in AVITI data you would be more likely to find a single low q score in the middle of a high quality read.”
Therefore, they recommend we try tweaking maxEE. Does this sound like a good idea? How should we evaluate/validate the results?
We have a huge number of reads to begin with so that’s no problem, and by truncating close to minimum required overlap we get comparable % of reads through vs. MiSeq. But it feels stupid to throw away high quality data just because I don’t completely understand what’s going on.
That's interesting. It makes sense that that could be causing problems given that dada2 was developed for Illumina sequencing data. At face value, raising maxEE seems reasonable given the greater quality variance. However I don't have the deepest understanding of the dada2 algorithm and can't guarantee that this is the best approach. I would recommend opening an issue on the dada2 GitHub, the developer is pretty responsive to questions and would probably be interested in or already knowledgable of dada2's applicability to the avidity sequencing technology.