Running DADA2 on pre-processed data

Hello dear QIIME 2 community :waving_hand:

I am analyzing data that was created by sequencing the 16S V3-V4 region on an Illumina NovaSeq 6000 creating 250bp paired-end reads (I am aware that sequencing this region with relatively short reads creates some challenges, but that's what I have).
The sequencing was performed at Novogene and they provide their standard bioinformatics pipeline included in the sequencing package. However, since I need to combine data from several sequencing runs and further process the data, I need to do some steps of the analysis myself.

Here are the processing steps of the data I received: barcodes and primers were removed with cutadapt, reads were merged using FLASH, low-quality reads were filtered using fastp, chimeras were removed using vsearch. After all of this, the data was imported into QIIME 2 and denoised using the dada2 plugin to obtain ASVs.

As I was reading about all of the steps listed above to decide at which step I should feed the data into my own analysis, I learned that it is generally advised to run dada2 on unprocessed, unmerged reads, see for example the followind forum posts:
Can DADA2 be carried on pre-joined reads - User Support - QIIME 2 Forum
Is my Dada2 output normal? - User Support - QIIME 2 Forum
Determining correct pre-processing pipeline for - User Support - QIIME 2 Forum

Additionally, I was confused about why chimera removal is done twice, once by vsearch and then again by dada2.

I asked the bioinformatics team at Novogene about the reasoning behind this. Here are their responses:

  1. They first use FLASH to merge paired-end reads to generate longer consensus sequences and then run dada2 denoise-single because it increases the overall merging rate. This prevents substantial read loss, which commonly occurs with DADA2’s built-in mergePairs function due to poor sequence quality or variable amplicon lengths.

  2. Chimera removal is performed using vsearch by comparing against the Silva database. Dada2 then performs reference-free chimera removal. This approach minimizes false-positive ASVs as much as possible.

So now my question is: what do the more experienced microbiome researchers in the community think about this? Should I trust the pre-processing steps and simply start with the output artifacts from dada2 I received and merge them according the tutorial here? Or is it better to go back to the raw data with just adapters and primers removed, import into QIIME 2 and re-run dada2 myself?

Thanks a lot in advance for any insights on this topic!

Hi @p_auritus,

RE 1:
As you've found, it is inappropriate to send merged data to qiime dada2 denoise-single .... My understanding is that the merged reads will have a very different error profile than what DADA2 expects. If you look at a quality plot of the merged reads, you'll see that within the region of overlap between the two reads, i.e. the center of the single merged read, will have high quality scores. That is, the quality scores up until that center point (from both directions) might vary or decline until it hits that high quality center region. That region of overlap ends up having higher quality scores as both reads are calling the same base at the same position, thus increasing the confidence of the base call at those positions. To be honest, I am not sure why they just do not use a complete vsearch / deblur approach, as outlined here in the old user docs. :man_shrugging: More on this later...

RE 2:
There are two types of chimera removal, de novo and "reference-based". The de novo approach became popular with userach / uchime, then later adopted by DADA2. This approach obviates the need for a curated reference database, which can be difficult to develop and curate. Often these curated reference databases used to be rare, and mainly SSU/LSU rRNA gene specific. Though, this is not as much of an issue these days... Anyway, it is not uncommon to use both, especially if you have access to a well trusted reference database, that covers most of the expected taxa in your sample . That is, it is possible that the de novo approach may not detect a chimera, when it in fact is a chimera, or vice versa.

Anyway to minimize the work required for making a reference database for every gene, the de novo approach was developed. However, there are some caveats with de novo chimera removal... in some cases they can remove real data, even highly abundant real data. So, you need to look closely at the chimera checking parameters for the various tools. Always try a few parameter settings, although default is often "good enough" many can run into problems, and end up hacking together inappropriate pipelines rather than simply using the tools built-in functinality.

There are two options I often suggest with DADA2: first, make sure you trim the reads appropriately. Having too much overlap can increase the chances of mismatch detection between the two reads, which can cause read merging failures. Obviously, reads being too short will not merge at all. Secondly, you may want to consider adjusting other DADA2 settings. For example, I often set --p-min-fold-parent-over-abundance to 8 or 16 as outlined here.

Given RE 1 above, if DADA2 does not work as well on your data you can modify the above linked deblur approach, by adding uchime-denovo available within qiime vsearch uchime-denovo --p-method uchime3 ... on your deblur /dada2 output. See here.

Speaking from my own personal experience, I'd strongly recommend processing the data yourself. :slight_smile:

3 Likes

Hi @SoilRotifer ,

Thank you so much for your detailed answer, this is really helpful!

I guess it makes sense to re-run the analysis myself to ensure DADA2 is run properly. Thanks also for the tips regarding the DADA2 settings!

However, I don't think I fully understand your recommendation here: my reads don't show any obvious drop in quality, so my initial idea was to not trim/truncate at all (I am talking about sequences with adapters & primers already removed). Do I still risk having too much overlap in such a case? And if yes, is there a way how I can figure this out other than just trying different combinations of trim/trunc parameters and then choosing the ones that maximize merging percentage?

Even if you are not observing drops in quality, does not mean the sequences are free of errors. It just means that there are less errors. Let's assume, in a magical world, that all of the base calls for all sequences are Q40. This means you have 1 error in 10,000 base calls across all of your data. That is, you can have a high quality score, or high confidence, in an incorrect base call. Thus, the longer your region of overlap for paired reads, inherently increases the chances of a mismatch. Once you go over the mismatch threshold the merge will fail. In this example it would be rare, and you'd loose some sequences, but likely not very significant.

I often like to aim for ~ 20-30 bases of overlap. By default DADA2 expects ~ 12 bases of overlap. In order to find a good ball-park range for you truncation parameters you can estimate how much overlap there is in your reads. I am not sure which primer set you are using for your V3V4 data, but I'll assume (as does this post), that you are using the 341F-805F primers. Thus:

  • 805-341 = ~464 bp amplicon length
  • Estimated overlap using 2x250 would be: 2x250 - 464 = 36bp.

If you search the forum for "calculate amplicon length overlap" or similar, you'll come across several forum threads.

Given the above, you should be good to go. You could then simply adjust your truncation parameters such that your overlap does down to ~20 bp. But that often is not necessary. However, I've had a few instances were reducing the overlap to this level, even with "good" data, helped substantially. But your mileage may vary.

Finally, do keep in mind that the above calculation can be affected after primer removal. For more details on that see this post.

3 Likes

It never occurred to me that too much overlap could be a bad thing, but your explanation makes a lot of sense, thanks!

As I am working with a relatively long amplicon sequenced with relatively short reads and therefore don't expect a very large overlap anyway (as you have nicely pointed out with your calculation above), I decided to first just try it without truncating.

Here's a screenshot showing the denoising stats for some of my samples. If I interpret this correctly, I lose only very few reads at the merging step, as "percentage of input passed filter" and "percentage of input merged" are very similar:

Out of my total of 77 samples, I see a clear drop in read counts at the merging step for only 2 samples where it drops to 82.21% and 78.19%, respectively. I think I am fine with that.

One thing I noticed though is that because I don't perform any filtering based on read-length here, I end up with a few outlier sequences that are much shorter than the rest in my rep-seqs. I think I will simply get rid of those outliers using feature-table filter-seqs.

Since I am not removing a lot of chimeras here, is my understanding correct that I probably don't need to increase --p-min-fold-parent-over-abundance as a higher number for this parameter would make chimera removal more conservative, i.e. even fewer reads would be removed?

Finally, I will consider adding a reference-based chimera removal step to minimize false-positive ASVs (in line with Novogene's pipeline).

2 Likes

Hi @p_auritus,

Looks like you got it! :100:

1 Like

Quick update for anyone who is also working with Novogene and might be curious about this:

I added a second, reference-based chimera removal step after dada2 using vsearch uchime-ref based on SILVA 138.2.

Since Novogene's argument for using FLASH to merge reads before denoising with dada2 was that substantial read loss can occur during dada2's built-in merging, I compared the number of reads I retained with my analysis to the number from their pipeline.
Out of my 77 samples, the read counts differed by less than 10% for 73 samples. There were 2 samples for which I had a drop in reads after merging (already mentioned those above) and for 2 other samples I removed a higher number of chimeras with uchime-ref. Therefore I would say the read counts obtained by the two methods are very similar overall.

However, the fact that I now performed denoising with dada2 how it is intended to be done by the method's authors plus being able to control every parameter along the way makes me trust my analysis much more.

Thanks again @SoilRotifer for your help!

3 Likes

Note the response here by Ben Callahan regarding why merging reads before error inference is not recommended (spelling corrected, sorry @benjjneb ):

Merging first would be faster, but the difference between what QXY means in the part of the merged read that was covered only by the forward read, only be the reverse read, or by both, remains the issue, as DADA2 does not know about those different contexts.

I can easily see how, practically speaking, coming up with a model that accounts for varying expected errors rates across merged/unmerged regions, and dealing with the known differences in error rates in R1 vs R2, would be nightmarish. Esp when dealing with vastly difference amplicon sequence data, including ones (like ITS) which aren't consistent, or V3-V4 which IIRC has a bimodal distribution.

5 Likes