I tried multiple different --p-trunc-len-r values but ~155 gave the best result in terms of the number of sequence reads left after DADA2 filtering. I was also careful not to truncate too much reducing the overlapping region. I also checked for non-biological sequences in the fastq files, and there are none. I also tried Deblur, a large proportion of reads are being filtered out there too. Thanks to the previous posts on the forum
I have another very similar data set (same PCR conditions, primers, sequencing platform, just a different cohort of mice). In terms of quality scores, this run looked very similar
I used the same parameters for the DADA2 step as above, but this data set gave me the following stats.dada2 output, showing at least 84% of reads are retained.
I am running out of ideas to troubleshoot and really think why two data sets with very similar sequence read qualities are giving me vastly different results and ways to reduce the number of reads that are being filtered out without compromising the quality of my outputs.
If anyone has any suggestions I will be very thankful.
I think you should try to decrease --p-trunc-len-f parameter (increase --p-trunc-len-r if necessary to keep overlapping region). This should keep more reads in filtering step since all shorter reads are discarded.
Hi @Hasinika, I got a hint (thanks @llenzi) that since your expected amplicon size is 300, you can just use your forward reads to proceed with the analysis. In that way bad quality scores of reverse reads will not affect denoising step.
Thank you @timanix (and @llenzi) for your suggestions! really appreciate your help here. I will try with just the forward reads, hope it fixes the issue.
Hi @timanix thanks again for your suggestions. I tried a few;
Only using forward reads- this did not improve the number of reads that passed the quality filtering step (this makes me confused as to what is going on)
I have already tried cutadapt on both the forward and reverse reads just in case there are non-biological sequences, does not look like I have any as the output files with and without this step have no difference.
The primers I used are the original 515F-806R according to the 16S Illumina Amplicon Protocol : earthmicrobiome
515F primer has AATGATACGGCGACCACCGAGATCTACAC as the 5' adapter, 806R has CAAGCAGAAGACGGCATACGAGAT
I then tried vsearch (for merging, and quality filtering) then Deblur, the commands are below. This actually retained a bit more sequences than I had with any other methods I have tried.
It looks like ~60-70% of reads are still being filtered out, "reads hit-reference" values file above (deblur-stats.qzv) is not a lot different compared to the DADA2 output for the same samples (shown in the 3rd post of this thread). So, DADA2, Deblur and vsearch +Deblur options are filtering out a lot reads, despite trimming away low-quality base calls and the quality score plots for this data set (especially the forward read) is not too bad. If you have any guesses why this may, I would be thankful for.
Also, the fact that only using the forward read did not improve the outputs puzzles me, do you think I have missed anything here?
Thanks again, any help with these will be really appreciated. Please let me know if I can provide any more information to better understand the issue.
Thank you so much for helping me with this. I have the following update since my last post.
Looking at the adapter content graphs from fastqc (provided by the sequencing service), there are differences in this problematic run and another very similar run that I have mentioned in the first post. Both runs were Illumina MiSeq (2x250 bp).
Problematic run which filters out many reads at the filtering steps of DADA2/Deblur/vsearch
The other run (same library prep conditions and sequencing conditions) has a different graph for adapter content, which according to our sequence service provider is usual
Both libraries were prepared using the following Illumina adapter, index, primer pad, primer link, and primer sequences
Adapters are in bold, unique indices (barcodes) are shown with XXXXXXX
806R CAAGCAGAAGACGGCATACGAGAT XXXXXXXXXXXX AGTCAGTCAG CC GGACTACHVGGGTWTCTAAT
I am not sure whether this has anything to do with the high % sequence loss I experience.
Any help in figuring out this problem will be really appreciated.
Hello!
Thank you for providing so detailed information and trying all the options before posting.
Several things to try...
Since after merging by VSEARCH your reads do not became significantly longer, you can just reimport only your forward reads as single reads to eliminate merging step and associated loses of reads.
After it, you can try to cut primers with cutadapt:
Now, since --p-discard-untrimmed enabled, you can see if primers were cut from the reads (if size is too small compared to the input, either the wrong primer was provided, or they were already removed).
Could you try to run it with --p-trunc-len 240 (if primers were still in sequence and you removed them, you should check quality plots and length of the sequences to put it even lower) or 0 (disabling trimming completely). Also, you can add --p-max-ee and set it to 5 or 10 to relax filtering parameters.
If Dada2 still not works for you, you can repeat Deblur (now with only forward reads) reads. This time, you should try with lower --p-trim-length based on the length of reads after cutadapt.
Neither do I so if other moderators/members have an input about it or in overall regarding the issue please join us in the comments
Thank you @timanix so much for taking time to help me with this, appreciate your input so much
I will retry with just the forward read, I gave up on it when I saw no improvements but you have a few great suggestions that I have not tested yet. Hope to get back soon
Hi @timanix and everyone, I found out that there are Ns in both the forwards and reverse reads of this problematic run of Illumina Miseq, and that seem to be why I have been loosing a lot of reads in the filtering steps. Relaxing "qiime quality-filter q-score" parameters (--p-max-ambiguous to 1) retained a large proportion of the reads, and Deblur seem to process most of them without an issue. I understand that this is not ideal, but would anyone know whether there are any similar parameters in DADA2 on QIIME2 (perhaps in "qiime dada2 denoise-paired" step) to tweak this, similar to the "max n" option on the R based DADA2?
Hi @Hasinika
Sorry for a long silence, your case is more complicated than I thought earlier. I got a hint regarding your topic, so I am going to post it here. @Mehrbod_Estaki , thank you and hope, you don't mind if I will cite your message in response.