dada2 large percentage of paired end 16S v4 reads not passing filter

Hello!

I am using DADA2 to denoise my 16S v4 (prepared using the EMP protocol with 515F-806R primers) paired end sequencing data, using qiime2-2021.8. For all samples, the percentage of reads passed filter seems very low to me, with the highest being 57.17% (denoising-stats.qzv (1.2 MB))

Here's the exact command I used. I didn't trim at all and hardly truncated because I thought our quality scores looked good enough to do so, with median quality scores for all sequence bases being above 25 (demux.qzv (325.0 KB)) :

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux_miseq20210813_20210903.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 251
--p-trunc-len-r 250
--o-table table-miseq20210813.qza
--o-representative-sequences rep-seqs_miseq20210813.qza
--o-denoising-stats denoising-stats_miseq20210813.qza

I'm a little confused as to why so many of my reads are not passing filter? We used 2x250bp paired end sequencing with a region that is ~253bp long, and I only truncated one sequence off of the reverse reads. So inadequate overlap should not be the issue, but is it possible it could be overlapping by too much (is that a thing)? Or maybe my read quality is just worse than I thought?

Sorry, I realize this is a common topic on the forum, but I have not come across this exact scenario in a previous post.

Thanks so much for your help.

Hi @Dot,

Thank you for sharing the demuz.qzv! It seems there are periodic drops in quality throughout the reads. This can become a problem when the forward and reverse reads are highly overlapping.

That is, it is great in cases when high-quality reads are highly overlaping, as you are more likely to generate very accurate ASVs. However, the occasional dips in quality for both reads become a problem in this case. These dips, along with the high amount of overlap between the read-pair, essentially increases the occurrence of mismatches between the reads. More details about this can be found here:

Since you have highly overlapping reads, you can be more liberal with your truncation parameters (to minimize detectable mismatches within the region of overlap). Read further down in the thread I linked above on that particular point.

-Cheers!
-Mike

2 Likes

Hi Mike,

Thank you so much for your prompt and detailed response! This makes a lot of sense. I took a look at the thread you included as well as the threads within that thread (thanks for linking all of those!), and still have a few lingering questions:

  • It makes sense that truncating more sequence bases would give me more remaining sequences that pass filter after DADA2, since this would create less overlap where potential mismatches would be detected. However, don't I want to detect all these mismatches to make sure my data is the most accurate? (i.e. is it better to just use the more stringent overlap criteria, while sacrificing the amount of reads that pass filter?)
  • I was considering using just the forward reads because overall each sequence base has a better quality score. However, is that less preferable than using paired-end reads, since there is no consensus between two reads at the same sequence base?
  • From what I've understood by reading other posts in the forum, it seems like a median quality score of above 20-25 is "decent" enough to keep and not warrant a trim or truncate at that position. In my quality score plots, the periodic drops in quality seem to not have median quality scores lower than 20. That's sort of why I ignored the fact that quality scores dropped at different positions along the 251 bp sequence length. Should I more frequently be taking into account the entire boxplot/ other percentiles of the quality score distribution and not just the median quality score values?

Thank you!

Great! Thank you @Dot!

Very good question. The recommendation I provided works best when it is clear that there is declining quality towards the 3' end of the reads. Which, admittedly, is not necessarily the issue in your case. However, there is a way to gauge the amount of reads you should retain, which leads us to your next question. :arrow_down:

Many on the forum, myself included, have had to proceed with our analyses by using only the forward reads. In my case the reads either did not overlap, or the quality of the reverse read was so poor, that I was left with little to no usable sequence data. :wastebasket:

In cases were I can be partially successful with merging reads, and am unsure of what to do... I will compare the output of the merged reads to that of using only the forward reads. By processing only the forward reads I will have an 'upper bound' of sequences that I'll likely never reach with merging paired-end reads (due to the issues we discussed earlier). But, if I obtain ~10k reads per sample using the forward reads, and ~7-8k reads per sample with my merged paired-end reads, then I am okay with using the paired-end reads. However, if I only retain ~2-4k reads per sample after merging with my paired-end reads, then I will likely play around with the truncation parameters to see if I can increase that number. Often, I simply adjust the truncation parameters as I mentioned earlier. This step often has the most significant impact on retaining more reads.

Also, the mismatches are only a concern over the region of overlap. :thinking: If we are fine with using only the forward reads for analysis, without any reverse read to help correct or confirm our base calls, then why not be okay with truncating the reads prior to merging? Sure, we'll keep some errors, but we'll gain a longer sequence, which often helps with taxonomic classification. Also, do not forget, the denoising algorithm will try and workout what is likely a real base call vs a sequencing error. Anyway, that is just my opinion, and I feel it has been working for me. Your mileage may vary. :fuelpump:

But either way, I would process the data with only the forward reads and the truncated paired reads and see which will provide the best data to help you with your research questions. I am sure many others will have variations on how they approach this problem. :slight_smile:

Totally understandable! Regardless, I often truncate at the last base or two of the read, as there can be unreliable estimates of quality, even if they appear to be high quality. So, in this case I'd truncate at 249 or 250. :scissors:

In fact, this is exactly what I do when approached with such data. If I have a choice, I take the rather subjective approach of, "if the lower part of the boxes often occur below Q20 or Q25, then truncate there", but as you've noticed we do not often have that luxury, so I tend to just truncate as much of the ends of the reads to maintain overlap and to get the reads to merge.

From here I compare the taxonomy assignments between the forward and merged reads. If the two data sets are similar, often with a little better taxonomy assignments of the longer reads, then I stick with the merged reads... but obviously this depends on the questions being asked of the data, and whether or not sequence abundances or taxonomic resolution is more important.

I hope this helps. Again, others likely have other experience that they'd like to share. :speaking_head:

2 Likes

All of this was extremely helpful, thank you again for all your help!

As suggested, I'm moving forward with comparing the output of the merged paired end reads vs. forward reads only, and then making a judgement call as to which one to go with. However, I have data from five separate miseq runs that I need to merge after denoising. I know the trimming parameters for paired end reads need to be the same for each run in order to merge the data from separate runs (as I've read about here). I just wanted to confirm that in this case it is best practice to pick the same option (i.e. go with all forward reads only or all merged paired end reads only for all the runs I'm merging) for consistency?

Thanks!

Yes, that is correct.

The gene region covered must be identical across all of the sequencing runs. Otherwise, you'll have many ASVs exclusive to a set of samples. That is, ASVs from the longer merged reads will not appear in samples in which only the shorter forward reads where used. Thus unduly affecting your diversity measures. Remember with ASVs, we differentiate on a single-nucleotide level, this includes sequence length variation or nucleotide substitutions.

As as noted in the thread you linked, each run can be slightly different in terms of its quality, you can use slightly different truncation parameters (not trim parameters) for each run to get the reads to merge.

1 Like

Got it! Thank you for your help!

1 Like

Sorry to follow up so late about this, but I have two more questions regarding this last statement:

As as noted in the thread you linked, each run can be slightly different in terms of its quality, you can use slightly different truncation parameters (not trim parameters) for each run to get the reads to merge.

Just to clarify, does this mean I can use different quality score "thresholds" to inform truncation decisions (e.g. trunc the sequence base where q<30 for one run, but use q<25 for another run?). I assumed yes based on posts like this one, but I just wanted to make sure I'm not misinterpreting what you had said.

Also, is it still acceptable to use different quality score thresholds to inform truncation decisions if two runs contain the same set of samples and I will be combining the actual sequence data for each sample ID (e.g. using: qiime feature-table merge with the --p-overlap-method sum parameter)?

Thank you!

Personally, I like to use the same quality settings for all my runs... otherwise you are being more strict with one run over another, and may artificially increase differences between your runs. That is you are keeping sequences in one run that are being discarded in another run. Though I am sure others will have varying opinions on this. The important thing, is that your sequences are of the same length and cover the exact same region.

If you are concerned about using q<25 on all your runs, keep in mind that there are other quality control / removal steps you can perform within :qiime2:, after denoising. Denoising does not remove all potential problems :worried: . I often run qiime quality-control exclude-seqs, and filter based on taxonomy, like so:

-Mike

Ah, okay I see. That makes sense, thanks! I will try to use the same quality settings then follow up with qiime quality-control exclude-seqs , and filter based on taxonomy as you suggested. Thank you!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.