DADA2 filtering out ~80% of reads

Hi all,

I ran DADA2 on an Illumina Miseq (2x250 bp) data set with 515F-806R primers (for the 16S rRNA gene).
The quality scores for reads look like this

So, based on this I ran the following command for DADA2

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-paired-end.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 250
--p-trunc-len-r 155
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza

The stats.dada2 file shows that I am losing close to 80% of sequence reads,

I tried multiple different --p-trunc-len-r values but ~155 gave the best result in terms of the number of sequence reads left after DADA2 filtering. I was also careful not to truncate too much reducing the overlapping region. I also checked for non-biological sequences in the fastq files, and there are none. I also tried Deblur, a large proportion of reads are being filtered out there too. Thanks to the previous posts on the forum :pray:t5:

I have another very similar data set (same PCR conditions, primers, sequencing platform, just a different cohort of mice). In terms of quality scores, this run looked very similar

I used the same parameters for the DADA2 step as above, but this data set gave me the following stats.dada2 output, showing at least 84% of reads are retained.

I am running out of ideas to troubleshoot and really think why two data sets with very similar sequence read qualities are giving me vastly different results and ways to reduce the number of reads that are being filtered out without compromising the quality of my outputs.

If anyone has any suggestions I will be very thankful.

Thank you in advance.

Cheers
Hasinika

Hello!

I think you should try to decrease --p-trunc-len-f parameter (increase --p-trunc-len-r if necessary to keep overlapping region). This should keep more reads in filtering step since all shorter reads are discarded.

1 Like

Thanks for the quick reply @timanix

I tried what you suggested on a subset of data (with the previous parameters all the samples in this subset had around 80-75% sequence loss).

When I tried

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-paired-end.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 190
--p-trunc-len-r 160
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza

There was some improvement (~5-10% more reads per sample), but not sure whether it is a significant one.

If there are any other suggestions, please let me know.

Thank you.

Here are some more options:

  1. If primers are still in the sequence, you need to cut them with cutadapt
  2. It is possible that bad quality scores in the middle part of the reverse reads are causing such severe filtering. If so, you can try:
  • play with --p-max-ee-r parameter
  • merge reads with vsearch-merge and then denoise it with Deblur
  • use vserach for merging, filtering and chimera removal.
1 Like

Hi @Hasinika, I got a hint (thanks @llenzi :wave:) that since your expected amplicon size is 300, you can just use your forward reads to proceed with the analysis. In that way bad quality scores of reverse reads will not affect denoising step.

3 Likes

Thank you @timanix (and @llenzi) for your suggestions! really appreciate your help here. I will try with just the forward reads, hope it fixes the issue.

1 Like

Hi @timanix thanks again for your suggestions. I tried a few;

  1. Only using forward reads- this did not improve the number of reads that passed the quality filtering step (this makes me confused as to what is going on)

qiime dada2 denoise-single
--i-demultiplexed-seqs demux-single-end.qza
--p-trim-left 0
--p-trunc-len 245
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza

  1. I have already tried cutadapt on both the forward and reverse reads just in case there are non-biological sequences, does not look like I have any as the output files with and without this step have no difference.
    The primers I used are the original 515F-806R according to the 16S Illumina Amplicon Protocol : earthmicrobiome
    515F primer has AATGATACGGCGACCACCGAGATCTACAC as the 5' adapter, 806R has CAAGCAGAAGACGGCATACGAGAT

qiime cutadapt trim-paired
--i-demultiplexed-sequences demux-paired-end.qza
--o-trimmed-sequences demux-paired-end-filtered_new.qza
--p-adapter-f GTGTAGATCTCGGTGGTCGCCGTATCATT
--p-front-f AATGATACGGCGACCACCGAGATCTACAC
--p-adapter-r CAAGCAGAAGACGGCATACGAGAT
--p-front-r ATCTCGTATGCCGTCTTCTGCTTG

  1. I then tried vsearch (for merging, and quality filtering) then Deblur, the commands are below. This actually retained a bit more sequences than I had with any other methods I have tried.

qiime vsearch join-pairs
--i-demultiplexed-seqs demux-paired-end.qza
--o-joined-sequences demux-joined.qza

demux-joined.qzv look like this

qiime quality-filter q-score
--i-demux demux-joined.qza
--o-filtered-sequences demux-joined-filtered.qza
--o-filter-stats demux-joined-filter-stats.qza

demux-joined-filter-stats.qzv file looks like this

qiime deblur denoise-16S
--i-demultiplexed-seqs demux-joined-filtered.qza
--p-trim-length 250
--p-sample-stats
--o-representative-sequences rep-seqs.qza
--o-table table.qza
--o-stats deblur-stats.qza

deblur-stats.qzv looks

It looks like ~60-70% of reads are still being filtered out, "reads hit-reference" values file above (deblur-stats.qzv) is not a lot different compared to the DADA2 output for the same samples (shown in the 3rd post of this thread). So, DADA2, Deblur and vsearch +Deblur options are filtering out a lot reads, despite trimming away low-quality base calls and the quality score plots for this data set (especially the forward read) is not too bad. If you have any guesses why this may, I would be thankful for.

Also, the fact that only using the forward read did not improve the outputs puzzles me, do you think I have missed anything here?

Thanks again, any help with these will be really appreciated. Please let me know if I can provide any more information to better understand the issue.

Cheers
Hasinika

2 Likes

Hi all,

Thank you so much for helping me with this. I have the following update since my last post.

Looking at the adapter content graphs from fastqc (provided by the sequencing service), there are differences in this problematic run and another very similar run that I have mentioned in the first post. Both runs were Illumina MiSeq (2x250 bp).
Problematic run which filters out many reads at the filtering steps of DADA2/Deblur/vsearch

The other run (same library prep conditions and sequencing conditions) has a different graph for adapter content, which according to our sequence service provider is usual

Both libraries were prepared using the following Illumina adapter, index, primer pad, primer link, and primer sequences
Adapters are in bold, unique indices (barcodes) are shown with XXXXXXX

515F AATGATACGGCGACCACCGAGATCTACAC** XXXXXXXX TATGGTAATT GT GTGCCAGCMGCCGCGGTAA

806R CAAGCAGAAGACGGCATACGAGAT XXXXXXXXXXXX AGTCAGTCAG CC GGACTACHVGGGTWTCTAAT

I am not sure whether this has anything to do with the high % sequence loss I experience.
Any help in figuring out this problem will be really appreciated.

Thank you so much!

Hasinika

Hello!
Thank you for providing so detailed information and trying all the options before posting.
Several things to try...
Since after merging by VSEARCH your reads do not became significantly longer, you can just reimport only your forward reads as single reads to eliminate merging step and associated loses of reads.

After it, you can try to cut primers with cutadapt:

qiime cutadapt trim-single \
    --i-demultiplexed-sequences demux-single.qza \
    --o-trimmed-sequences trim-single.qza \
    --p-front GTGYCAGCMGCCGCGGTAA \
    --p-match-adapter-wildcards \
    --p-discard-untrimmed \
    --p-match-read-wildcards

Now, since --p-discard-untrimmed enabled, you can see if primers were cut from the reads (if size is too small compared to the input, either the wrong primer was provided, or they were already removed).

Could you try to run it with --p-trunc-len 240 (if primers were still in sequence and you removed them, you should check quality plots and length of the sequences to put it even lower) or 0 (disabling trimming completely). Also, you can add --p-max-ee and set it to 5 or 10 to relax filtering parameters.

If Dada2 still not works for you, you can repeat Deblur (now with only forward reads) reads. This time, you should try with lower --p-trim-length based on the length of reads after cutadapt.

Neither do I so if other moderators/members have an input about it or in overall regarding the issue please join us in the comments :hugs:

1 Like

Thank you @timanix so much for taking time to help me with this, appreciate your input so much :pray:t5:

I will retry with just the forward read, I gave up on it when I saw no improvements but you have a few great suggestions that I have not tested yet. Hope to get back soon :crossed_fingers:t5:

Hi @timanix and everyone, I found out that there are Ns in both the forwards and reverse reads of this problematic run of Illumina Miseq, and that seem to be why I have been loosing a lot of reads in the filtering steps. Relaxing "qiime quality-filter q-score" parameters (--p-max-ambiguous to 1) retained a large proportion of the reads, and Deblur seem to process most of them without an issue. I understand that this is not ideal, but would anyone know whether there are any similar parameters in DADA2 on QIIME2 (perhaps in "qiime dada2 denoise-paired" step) to tweak this, similar to the "max n" option on the R based DADA2?

Thank you for helping.
Cheers
Hasinika

1 Like

Hi @Hasinika
Sorry for a long silence, your case is more complicated than I thought earlier. I got a hint regarding your topic, so I am going to post it here. @Mehrbod_Estaki , thank you and hope, you don't mind if I will cite your message in response.

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.