Data: ITS sequencing for fungal community analysis, Illumina 1.9
The Qiime version is: q2cli version 2022.11.1 using Miniconda
I have encountered an issue after importing and denoising the data using DADA2. The percentage of non-chimeric input is fairly low, so I was looking into ways of increasing it. While reading on the forum I found about the –p-min-fold-parent-over-abundance parameter and I increased it to 4, however this only slightly increased the value of non-chimeric input-I understood a good value would be above 80%, but my highest reaches 67%- (I attach the denoise_stats visualization; for the denoise parameters I chose --p-trunc-len-f 250 --p-trim-left-f 0 --p-trunc-len-r 222 --p-trim-left-r 11 considering the Quality plots)
I know a possible cause might be the presence of primers or adapters into the sequence, however I only received the data and no such information. I was looking into q2-cutadapt plugin but I saw that it can only search for specified adapter sequences which I don’t know. Therefore, I tried using the Trim Galore plugin as it can automatically search for most common adapter sequences. But the output I get is:
While trying to identify the issue. I made Fasqc reports for my data, but I came across something strange. I am worried about the “GC content per sequence” as it doesn’t follow the normal distribution and has many sharp spikes. As I’ve seen this corresponds to overrepresented sequences, and I am not sure if this perhaps can be an indicator of the presence of adapters or primers, or if it’s due to the nature of my data (I’ve read that this might be common for RNA)
I would deeply appreciate your help. If there’s anything else you need please let me know. Also, I apologize if there are any blatant mistakes or obvious answers. This is my first time working with Qiime2 and sequencing altogether.
I have tried everything I could think of, and now I started listening to the Fundamentals of Bioinformatics classes over on the QIIME2 YouTube channel, maybe I could solve the issue by getting a deeper understanding of the basics. However I haven't been able to make any progress by myself, so ifanyone would be able to help it would be much much appreciated!
You are right that getting a deeper understanding of the basics is a great way to understand what is going on with any problem, but I will caution you there is a lot to learn before it feels helpful, speaking from personal experience Also, you might find watching the videos from our workshops to be a more expedient route and are more directly related to the exact underpinnings and process of analysis, here is a link to our most recent one. It uses the Galaxy interface, but the tutorial itself should let you display the commands with any of our interfaces using the dropdown at the top.
Also, 67% passing as non-chimeric is not bad, the real concern is more when you are getting like 70-80% that are chimeric, looking through the files you posted, I think your data is worth working with as-is. In the future, to keep from having to throw out lots of chimeras, keeping PCR cycles down and good wet-lab technique are the things that will help out. You certainly can play around with –p-min-fold-parent-over-abundance to see if it produces better results, though this is generally a last ditch effort to salvage otherwise unusable data. Looking at your demux output, you might also play with trimming the beginning of both reads at around position 16, and possibly truncating the reverse reads around position 220. Looking at the interactive quality plot, simply adding the trim step should leave some overlap for merging and might help your denoising results a decent bit.
Regarding the removal of non-biological information, you can often get more information by contacting the sequencing center and asking about primers used and any processing that was performed before returning the data to you. That being said, a lot of times this is not possible, such as if you are working from old data that someone else had sequenced. If it turns out to not be possible, you can try running cutadapt(which is available as a QIIME 2 plugin!) with a list of the most common primers for your target region.
I wasn’t sure if continuing like this would be okay or not, so thank you for the guidance. However, I decided to give a try to the other things you proposed and see if I could maybe improve my reads’ quality.
Firstly I changed the trimming parameters to the ones you suggested:
I looked into the most common primers and after manually searching for them in my fastq forward and reverse files I decided to use cutadapt on these two primer pairs: ITS3 (GCATCGATGAAGAACGCAGC)-ITS4 (TCCTCCGCTTATTGATATGC) and ITS86F (GTGAATCATCGAATCTTTGAA) -ITS86R (TTCAAAGATTCGATGATTCAC) -both pairs seemed to appear multiple times in the fastq files-
But when looking at the quality plots, trimming seemed to be still necessary so for that I used DADA2 with the same parameters as before (except --p-trunc-len-r 220, which I changed to --p-trunc-len-r 217), but the results are coming out very weirdly.
Does this maybe have to do with there not being enough overlap between the reads after the trimming (since removing primers with cutadapt shortened the sequences, using the same dada2 trim parameters forces the sequences outside the overlap range)? If so, what would you consider good parameter values? Or is it a different issue altogether?
If you need any more details please let me know. I will send them right away. Thank you very much for your time!
In the final case, it does look like the reads are not long enough to merge, as you are getting extremely low merge percentages now. I think I would at least try just removing those primers and seeing how a denoising run goes without any additional trimming, while DADA2 might end up dropping some reads due to excess noise detection, in the end you still might end up with more reads passing through in total using this method.
I tried your suggestion and did dada2 denoising without trimming, however this results in the same issue as before (extremely low merge percentage). I am attaching the denoise stats, in case you are interested.
No_trimming_cutadapt_dada2_denoise: xtrim_mort_dada2_stats.qzv (1.2 MB)
So I think I will continue the analysis with your previous suggested trimming parameters, and no cutadapt.