I have a question about quality check and trimming in qiime2. I usually trimmed my sequences based on the primer sequence and the quality score below 20 using trimmoatic. However, the way trimming works with qiime 2 is that we should trim based on the sample position where the quality drops. However, my quality drops at 236 position and goes up again at 245. Is there anyway that I can trim at a specific quality instead of a position? Also, for trimming the primer, should I use --p-trim-left 19 if my primer is 19 nucleotide?
QIIME 2 has multiple different approaches to trimming available.
For manual trimming we usually recommend trimming where quality drops off below a threshold (my rule of thumb is where median Q drops < 20), not just arbitrarily wherever quality begins dropping.
One reason why we usually recommend manual trimming at a specific site is because otherwise identical sequences trimmed to different lengths will be considered different ASVs/OTUs, skewing diversity estimates. This is less of a concern with paired-end sequences, as sequences are being merged back into complete amplicons or dropped if they are too short.
You can do that in QIIME 2 as well! See the quality-filter plugin. Additionally, dada2 has a --p-trunc-q parameter to perform this type of trimming (e.g., see here).
Yes, if your primers are still in the sequence (they aren't always, depending on the sequencing protocol you used).
Do you recommend that I use paired-end sequences if I have them? Our previous experience in the lab was that we usually got less diversity and worse reads when we used both sequences. My concern for cutting the sequences at a specific length is that I will lose a lot of reads. Also, if I trim them based on quality score of 20, anyways the length would change between different reads. How can I avoid that?
that's a matter of personal taste/goals. I have been in your same position β bad merges on paired-end reads are worse than just using single-end. But I would always give it a shot if you have the data.
You lose some information, but that may not necessarily impact classification and it is better than skewing diversity estimates. Depends on your goals, though (e.g., if you only care about taxonomic classification, not diversity estimates, then it all comes out in the wash)
Yep, that's the point I was making above. There is no avoiding that if you are using single-end reads and want to trim dynamically. If using single-end I would just recommend at a fixed position.
Your reads look like they are high quality so you could probably get by without truncating! Let dada2 filter out the few sequences that do contain too many errors. (still trim out your primer(s))
Thank you so much for your helpful comments. So if I do not truncate and only remove the primers, is that how my code should look like? and would that also filter out erroneous sequences or should I add anything to this code for that? I am new to qiime 2, so please forgive my ignorance.
Sure. I will try both and compare the data. If I use paired-end reads, then I guess I can trim based on a specific quality score and get away with read length issue, right?
so the reason I was concerned about this was because I have a few negative control samples (water). Truncating at position 236, removed most of the reads for two of them and I ended up having only 1 and 4 reads left for them. They originally had more than 200.
Correct β as long as trimming based on quality scores doesn't trim too much and the sequences cannot overlap.
This is caused by one of two things:
you have already trimmed the reads based on quality, so some of those sequences are shorter than 236 and get filtered out automatically.
those negative control sequences are so riddled with errors that they get filtered out if you cut at a position that seems reasonable for other samples.
I tried dada2 with βp-trunc-len 0 / and βp-trim-left 19 and checked my read counts again. The issue persists with negative controls. Even though I did not truncate, read counts are still dramatically low (4 for negative control 1 and 2 for negative control 2). I imagine this happened as part of quality filtering that dada2 does. How would dada2 do this when I never asked it to do using any command line?
dada2 always performs quality filtering β it is a denoising method, not a dereplication method.
Rather, this could be because you did not truncate. Without any truncation, you are including low-quality bases (I am guessing the negative controls have lower quality reads overall since you mention this is expressed most strongly in the controls). dada2 is going to toss out any read that has > 2 probably errors. This is the whole purpose of truncation prior to using dada2.
You should also always examine the stats file output by dada2 (use qiime metadata tabulate to view as a QZV file) β this will tell you at which point of the dada2 pipeline reads are being dropped. Since you are doing single-end this is almost certainly at the filtering step, but could also be chimera filtering.
I think it is probably a good thing that your negative control reads are being filtered. They may be chimera or other bad reads and dada2 is doing its job. But you need to check out that stats file to figure out more.
That statement is true for any method in any software: if you do not specify a parameter setting, the default is used.
So in this case --p-max-ee is the parameter that performs pre-denoising filtering. You can adjust that parameter if you think this is too stringent for your data, but that sort of runs counter to the purpose of using dada2 (and QC in general) unless if, e.g., you have very long reads and think that 2 is not appropriate.