Questions about 16S data from Novogene (UK)

KQUB · July 21, 2023, 6:07am

Good advice, it seems. I also found some other forum posts where the solution was to run two separate cutadapt commands: here, here and here. I will probably do the same, then.

I'm not sure I follow. I have been talking about removing the V3–V4 primers (a.k.a. the amplicon primers, the PCR primers) from near (but not at) the 5' ends of my reads. As far as I understand, these primers are not parts of the adapters. They should instead be part of the DNA insert shown below. The sequencing primers (or sequencing primer binding sites; i.e. Rd1 SP and Rd2 SP) are parts of the adapters, but the V3–V4 primers aren't, right? My understanding is that it therefore doesn't matter whether the adapters or V3–V4 primers are trimmed first.

Is that true? What makes you say so? I've read that quality plots for NovaSeq data tend to look different to, for example, quality plots for MiSeq data because of the binning of quality scores (a.k.a. Q-scores): other examples here, here, here, and here. In other words, it's my understanding that the Q-score for a given base is an average of many grouped bases. Therefore, the "actual" Q-score for a given base could be lower than expected. I'm no expert on this, though.

Whatever the case, I guess we can only work with the Q-scores we have. Here are some quotes from an Illumina PDF about NovaSeq™ 6000 System Quality Scores:

"A Q-score of 30 (Q30) corresponds to a 0.1 percent error rate in base calling, and is widely considered a benchmark for high-quality data."

"The three groups in the quality table correspond to marginal (<Q15), medium (~Q20), and high-quality (>Q30) base calls, and are assigned the specific scores of 12, 23, and 37 respectively.* Additionally, a null score of 2 is assigned to any no-calls."

The Q-scores for our data (i.e. 2, 11, 25, 37) are slightly different than those described above (i.e. 2, 12, 23, 37), but the idea seems the same. Would it not make sense to trim all bases with Q-scores < 25 from the 3' ends of our reads, so that we are left with only medium and high-quality base calls at the 3' end?

Or are my quality-filtering ideas a waste of time? Because...

I know I could use the truncation function of qiime dada2 denoise-paired to trim bases from the 3' end of reads instead of doing any kind of quality-based filtering, but it seems like it could be trickier to chose the truncation values with NovaSeq data than with, for example, MiSeq data, because the quality plots from demux.qzv don't seem to give as clear of a indication as to where would be a good position to trim. Or what do you think?

Thanks as always for your time, @colinvwood!