Quality plot and trimming for DADA2

rscotti · October 26, 2022, 4:18pm

Hi everyone,

I'm experiencing some issues with the quality control step.
For the first time we have used a company (Novogene) for the amplicon sequencing (16S, ITS and archaea). They have provided to us the raw sequences (sequences with barcodes and primers) and the "clean" sequences (sequences with barcode and primer removed). I have imported and processed both of them.
The fastq files weren't in Casava format, so I imported them manually with a manifest file.
After importing them, I just had a look to the imported data to check the quality and decide the trimming size for the following denoising step, but they look like this...
These are the raw ones (including primer and barcode):

and these are the "clean" ones, without primers and barcodes:

First: they are shorter than usual, only 250bp, instead of 300bp, probably they haven't use a Miseq platform.

Second: the quality seems to be really high, right? Normally, I observe a drop in quality at the beginning or in the end of the sequence, but here there isn't. Is that possible?

Third: Should I trim these sequence? They look already short and with a good quality. What's the best position for the trim?

Thanks in advance for your support
Riccardo

Keegan-Evans · October 26, 2022, 6:25pm

@rscotti,

Looking at NovoGen's website, it looks like they do targeted sequencing using an Illumina NovaSeq 6000 machine. The NovaSeq uses the RTA3 software from Illumina to process the data and generate a simplified quality score representation, which produces a smaller number of (purportedly) more meaningful quality scores, which allows for faster processing of the data in downstream steps with similar accuracy. Illumina has an Application Note that details how this quality score is calculated.

You are not seeing as dramatic a drop off of quality at the ends, because you are looking at an average for the position, which will hide individual lower scores that you might see in reads from systems like Illumina MiSeq. The general consensus seems to be that you should just run a more or less normal workflow, with the understanding that things might get weird, see this discussion on the DADA2 github page about concerns related to denoising binned data, the code examples are in R, but QIIME2 is using it in the background to perform our denoising with Q2-DADA2, so the discussion is relevant.

I would plan on trimming where there is a distinct "block" of lower scores, the denoising can accomodate for a few lower scores, but if you have a lot, the amount of data kept from your reads will start to drop off fairly quickly. In your case, in both the forward and reverse reads, it looks like you could probably get away with truncating around bp 175 or so and not worry about any trimming.

rscotti · October 29, 2022, 3:13pm

Ok, thanks for your help!
I'll try a couple of settings (truncating around 175bp and around 200bp) to check the difference results.
Since they have used a Novaseq platform, the reads are already shorter than usual (Miseq), and I don't want to make them too short...

system · November 29, 2022, 9:14pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.