Trim based on quality score not the sequence position

Negin · November 7, 2018, 12:50pm

Hi,

I have a question about quality check and trimming in qiime2. I usually trimmed my sequences based on the primer sequence and the quality score below 20 using trimmoatic. However, the way trimming works with qiime 2 is that we should trim based on the sample position where the quality drops. However, my quality drops at 236 position and goes up again at 245. Is there anyway that I can trim at a specific quality instead of a position? Also, for trimming the primer, should I use --p-trim-left 19 if my primer is 19 nucleotide?

Thanks

Nicholas_Bokulich · November 7, 2018, 1:02pm

That is a misunderstanding.

QIIME 2 has multiple different approaches to trimming available.
For manual trimming we usually recommend trimming where quality drops off below a threshold (my rule of thumb is where median Q drops < 20), not just arbitrarily wherever quality begins dropping.

One reason why we usually recommend manual trimming at a specific site is because otherwise identical sequences trimmed to different lengths will be considered different ASVs/OTUs, skewing diversity estimates. This is less of a concern with paired-end sequences, as sequences are being merged back into complete amplicons or dropped if they are too short.

You can do that in QIIME 2 as well! See the quality-filter plugin. Additionally, dada2 has a --p-trunc-q parameter to perform this type of trimming (e.g., see here).

Yes, if your primers are still in the sequence (they aren't always, depending on the sequencing protocol you used).

I hope that helps!

Negin · November 7, 2018, 5:22pm

Do you recommend that I use paired-end sequences if I have them? Our previous experience in the lab was that we usually got less diversity and worse reads when we used both sequences. My concern for cutting the sequences at a specific length is that I will lose a lot of reads. Also, if I trim them based on quality score of 20, anyways the length would change between different reads. How can I avoid that?

Nicholas_Bokulich · November 7, 2018, 5:37pm

that's a matter of personal taste/goals. I have been in your same position — bad merges on paired-end reads are worse than just using single-end. But I would always give it a shot if you have the data.

You lose some information, but that may not necessarily impact classification and it is better than skewing diversity estimates. Depends on your goals, though (e.g., if you only care about taxonomic classification, not diversity estimates, then it all comes out in the wash)

Yep, that's the point I was making above. There is no avoiding that if you are using single-end reads and want to trim dynamically. If using single-end I would just recommend at a fixed position.

Your reads look like they are high quality so you could probably get by without truncating! Let dada2 filter out the few sequences that do contain too many errors. (still trim out your primer(s))

Negin · November 7, 2018, 8:13pm

Hi Nicholas,

Thank you so much for your helpful comments. So if I do not truncate and only remove the primers, is that how my code should look like? and would that also filter out erroneous sequences or should I add anything to this code for that? I am new to qiime 2, so please forgive my ignorance.

qiime dada2 denoise-single
--i-demultiplexed-seqs V4-20181012-demux.qza
--p-trim-left 19
--o-representative-sequences V4-20181012-rep-seqs-dada2.qza
--o-table V4-20181012-table-dada2.qza
--o-denoising-stats V4-20181012-stats-dada2.qza

And I imagine I do not need to include --p-trunc-len ?

Negin · November 7, 2018, 8:14pm

Sure. I will try both and compare the data. If I use paired-end reads, then I guess I can trim based on a specific quality score and get away with read length issue, right?

Negin · November 7, 2018, 8:16pm

so the reason I was concerned about this was because I have a few negative control samples (water). Truncating at position 236, removed most of the reads for two of them and I ended up having only 1 and 4 reads left for them. They originally had more than 200.

Negin · November 7, 2018, 8:41pm

sorry I figured it should be there but I should set it at 0
--p-trunc-len 0 /

Nicholas_Bokulich · November 8, 2018, 1:50pm

Yes

Correct — as long as trimming based on quality scores doesn't trim too much and the sequences cannot overlap.

This is caused by one of two things:

you have already trimmed the reads based on quality, so some of those sequences are shorter than 236 and get filtered out automatically.
those negative control sequences are so riddled with errors that they get filtered out if you cut at a position that seems reasonable for other samples.

Negin · November 8, 2018, 6:51pm

I tried dada2 with –p-trunc-len 0 / and –p-trim-left 19 and checked my read counts again. The issue persists with negative controls. Even though I did not truncate, read counts are still dramatically low (4 for negative control 1 and 2 for negative control 2). I imagine this happened as part of quality filtering that dada2 does. How would dada2 do this when I never asked it to do using any command line?

Negin · November 8, 2018, 6:52pm

I did not trim them before so it should be the second reason.

Negin · November 8, 2018, 6:53pm

I mean the quality filtering. I did not use anything except for

qiime dada2 denoise-single
–i-demultiplexed-seqs V4-20181012-demux.qza
–p-trim-left 19
–p-trunc-len 0 /
–o-representative-sequences V4-20181012-rep-seqs-dada2.qza
–o-table V4-20181012-table-dada2.qza
–o-denoising-stats V4-20181012-stats-dada2.qza

Which part of this is asking for quality filtering?

Nicholas_Bokulich · November 8, 2018, 7:07pm

dada2 always performs quality filtering — it is a denoising method, not a dereplication method.

Rather, this could be because you did not truncate. Without any truncation, you are including low-quality bases (I am guessing the negative controls have lower quality reads overall since you mention this is expressed most strongly in the controls). dada2 is going to toss out any read that has > 2 probably errors. This is the whole purpose of truncation prior to using dada2.

Before proceeding, I recommend that you check out the original publication for dada2 to learn more.

You should also always examine the stats file output by dada2 (use qiime metadata tabulate to view as a QZV file) — this will tell you at which point of the dada2 pipeline reads are being dropped. Since you are doing single-end this is almost certainly at the filtering step, but could also be chimera filtering.

I think it is probably a good thing that your negative control reads are being filtered. They may be chimera or other bad reads and dada2 is doing its job. But you need to check out that stats file to figure out more.

Negin · November 8, 2018, 7:55pm

Thank you for your reply. So does this mean that even if I do not specify a parameter, the default value will be executed?

Nicholas_Bokulich · November 8, 2018, 8:26pm

That statement is true for any method in any software: if you do not specify a parameter setting, the default is used.

So in this case --p-max-ee is the parameter that performs pre-denoising filtering. You can adjust that parameter if you think this is too stringent for your data, but that sort of runs counter to the purpose of using dada2 (and QC in general) unless if, e.g., you have very long reads and think that 2 is not appropriate.

Negin · November 9, 2018, 12:14am

Thanks so much Nicholas for your help. I posted another topic here FastQC compared to QIIME 2 quality plot

I would be appreciated if you could take a look at that.

system · December 10, 2018, 6:23am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.