I am new to QIIME2 and bioinformatics. After going through the tutorial, I begin to run my analysis.
However, something strange happens when I try to visualize the sequence quality to determine the length to truncate.
My data is demultiplexed pair-end reads. and I am targeting V3-V4 region using 341F 806R and Illumina Novaseq.
I firstly remove the barcode and primers all together using command
Another question is, I am targeting v3v4 region, which is 465 bp length, since I have barcodes included in the sequence which is 6bp each, the effective sequence length is 244 bp. So even I does not truncate any further, the maximum overlap I can get is only 23 bp (244 x 2 - 465). Based on the quality plot I got here, should I truncate any further? or not to truncate to make sure merging?
Welcome to the forum, @Ming!
Cutadapt trim-paired searches your sequences for the adapter sequences you provide, and removes them if it finds them. If it doesn’t find a match, it can’t remove anything.
Notice that the quality in your plot seems to drop starkly right around the positions where you’ve trimmed. I’m not confident in this, but I wonder whether your cutadapt command trimmed your sequences successfully, and the sequences that remain at full length are the ones with poorer sequence quality, with corrupted barcode/primer sequences that didn’t match the cutadapt query.
You might be able to correct this by making the --p-error-rate more permissive. How do you plan to merge your sequences? If you’re planning to denoise and join your reads with DADA2, you may be able to skip cutadapt and trim those adapters off during denoising with --p-trim-left-f and --p-trim-left-r.
Re: merging, you’re cutting it pretty close there! Shouldn’t the length of the primers you removed with cutadapt also be deducted from the raw sequence length, so your actual biological sequences are 227 and 224 long? I’m not sure what your merging protocol requires, but DADA2 needs at least 12bp overlap (and a few extra bp never hurt).
NovaSeq uses binned quality scores, so how you think about trimming may be a little simpler. Ideally, keep only “high-quality” positions, where the median Q score is 37. If you don’t have enough length to merge sequences after trimming that way, then lower your standards or skip trimming, and see how many sequences come out of merging with adequate quality.
On further consideration, I’m not sure whether q2-DADA2 is currently set up to support binned quality scores like those from NovaSeq. On the off chance you were planning to go that route, you may want to proceed with caution. I’ll ask around and see what else I can learn.
OK, based on this issue, it looks like the DADA2 folks are working towards explicitly supporting NovaSeq, but do not yet. The issue includes some recommendations for a workaround you could use with the original DADA2 R package, as well as some sample code, but that won’t be supported in QIIME 2 until after it’s officially supported in a DADA2 release.
Side note: it looks like the DADA2 developers are still looking for NovaSeq data from known communities they can use to support the development effort. If you know of any that might be available, consider reaching out to them directly.
Thank you very much for your detailed reply.
There are something I want to clarify.
For the first question, did you mean there is less than 2% of the sequence that haven’t been trimmed successfully (since 98 percentile results look ok); therefore, they are still presented in the quality plot?
Based on your reply, my biological sequence is 227 and 224 bp, however, my targeted region is 465 bp. So there is a 3 bp gap between the reads. In this case, how could my reads be merged ?
I also tried merging first using flash2 and then using pre-merged reads for denoising with deblur. So far, the denoising step running ok and I got reasonable proportion of merged reads passing denoising.
If my reads cannot merge, I assuming no reads will pass denoising. I don’t know why I can still got at least 6000 sequencing left.
That’s not how I’d put it (cutadapt correctly did what you asked it to, after all), but I think we’re on the same page here. Again, I’m not very experienced with cutadapt, but I think a small percentage of your sequences were not trimmed, because the adaptor sequence quality didn’t meet cutadapt’s error rate.
The sequences that cutadapt couldn’t match are the only sequences long enough to show quality scores above position 224/227.
465 − 227 − 224 = 14
There’s some overlap there for you to work with, just not a whole lot. This can be an important consideration when you’re deciding which amplicons to work with, and how long your reads need to be when planning a study.
Please open separate questions for each unrelated question. It makes it easier for both forum mods and users searching for information.
I’m so sorry for the confusion, @Ming. I forgot to put my math on this morning.
You’re exactly right that 224 + 227 is less than 465, and doesn’t allow you any overlap. Joining reads isn’t going to be possible here, but your quality looks quite good, so you can probably still get decent downstream analysis by considering only the forward reads.
@Ming, I’m not personally familiar with Flash 2 and don’t have enough information about your situation to answer this question adequately. Based on the README and a quick skim of the original paper, Flash 2 has no way of knowing whether your sequences do or do not overlap. All it can do is compare the mismatch ratios of forward/reverse pairs at every possible overlap and select the best one that exceeds a minimum mismatch density. For this reason, Flash 2 is not intended for use with read pairs that do not overlap.
If this is something you want to explore in greater detail, you’re welcome to open a new topic in “Other Bioinformatics Tools”. Maybe with more information or experience, someone from the community can help.
@ChrisKeefe did a great job explaining things in this post, but I wanted to point out a few additional items that didn't get covered, which might help in the future:
The "Demultiplexed sequence length summary" doesn't actually tell you the min and max lengths observed, its telling you the distribution of lengths observed, within the subsampled visualization (10000 sequences in the screenshot shared). I think what you can conclude from those tables is that 98% of the subsampled forward reads are at least 227 nts long, and 98% of the subsampled reverse reads are at least 224 nts long. But, that does mean that 2% of the subsampled reads can be longer than that, but we don't know, at least from that table. As @ChrisKeefe mentioned, this might be due to cutadapt not trimming all of the adapters off of your reads.
If you want to see how many sequences in the subsampled data are 250 nts long (or whatever the max length on the interactive plot is), hover your mouse over the boxplot at the last position on the x-axis of the plot - the text below the plot should update and tell you how many sequences it observed with that many nts in it.
If you want to disable subsampling, re-run the command, providing a --p-n 5250092 parameter, which will tell this visualization to run using all of the sequences in the dataset (I got that number from your screenshot above, if running with a different dataset, this number will be different). This will provide a comprehensive overview of the data, but it comes at a computational cost: be prepared to wait a little longer for results.