Dada2 and big dataset - optimize or/and split?

I'm processing a ~150GB 16S amplicon dataset (from a single AVITI run, 400+ samples, ridiculously deep with >1 million reads / sample), using QIIME2 2024.2. Denoise-paired with default settings (see below) takes more than 72 hours (using 20-40 cores and up to 370 GB RAM). That is, approx. 10-20 min per sample. This is a problem, as our supercomputer jobs are limited to 3 days.

I'm not sure what might be an optimal amount of RAM per core (I ran some trials, but a bit difficult to say as the speed seems to vary). Is there something else that could be optimized?

I guess it's also possible to split the dataset to several smaller batches, like artificial runs basically? I don't know yet how, but probably will find this in the docs. Would this affect the results somehow? Should we split each sample group randomly to those "runs" or does it make any difference?

My settings currently:

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-paired-end.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 243
--p-trunc-len-r 204
--p-n-threads 40
--verbose
--o-table table-72h.qza
--o-representative-sequences rep-seqs-72h.qza
--o-denoising-stats stats-dada2-72h.qza

Hello!
Yes, you can split your single run samples into artificial batches. Dada2 developers recommend to have 1M reads for Dada2 to train the error model. So, theoretically, you can run it even for every sample separately if you have 1M reads per sample. Make sure to run each split with the same settings in Dada2 to keep it compatible.

Another option to consider is to subsample your demultiplexed samples (check plugins documentation).

Also, be aware that Dada2 was specifically designed for Illumina produced sequences and if Aviti have different scoring approach that may affect Dada2 output. It is recommended to increase max EE values, but I am not sure at all how good Dada2 can handle Aviti files.

If you will decide not to use Dada2 you can merge paired reads with vsearch plugin and proceed with Deblur as alternative denoising tool. Still use the same settings for each split.

Regarding how to split your samples into artificial runs, I would split it by different sample types (if you have it) if they are not going to be compared to each other. For example, if you have treatments and sample types, and effect of the treatment is the main focus, I would split it by sample type. Or split it randomly.

Best,

1 Like

Thanks a lot for quick & helpful reply!

Yes, the different characteristics of AVITI data is another issue we are currently wondering. I yesterday opened an issue in the dada2 Github to learn a bit more.

Do you have recommendations how to evaluate / validate results after increasing maxEE?

Or would you actually recommend using the other methods - are they less sensitive to the sequencing method? I understand AVITI reads tend to have more variable quality in the middle of the read (although the overall quality scores seem to be much higher). This is perhaps the reason we apparently need to truncate AVITI reads much shorter than expected based on phred scores, to get decent % of accepted reads.

1 Like

I would just increase it little by little until it will reach high enough % of reads in the output. For proper evaluation I would try to reach Dada2 developers.

I don't have experience with aviti data but I would go for deblur to denoise it since it will not assume illumina scores.

Increasing ee values is another approach to get more reads through filters. Shorter truncation is OK unless it decreases overlapping region to the point that reads fail to merge.

Correction:
Deblur also assume illumina error profiles. I would try both and use the one that outputs higher percentage of reads passed through filters.

2 Likes

Hello everyone,

I'm new to processing amplicon datasets and I'm encountering a similar issue with the processing time using QIIME2. I'm working with a ~100GB dataset and finding that denoise-paired with default settings is taking a very long time (over 24 hours) on our supercomputer.

I see that you're using 20-40 cores and up to 370GB RAM for your processing. Have you tried adjusting the --p-n-threads parameter to see if it affects the processing time? I'm also curious if anyone has experience splitting the dataset into smaller batches, as mentioned. Does this affect the results or is there a recommended way to do this without compromising the analysis?

Any insights or suggestions would be greatly appreciated! Thank you.

1 Like

Hello Lisa! Happy to share ideas & experiences!
I'm currently running dada2 with 20 vs 40 threads. Please see a primitive comparison attached (based on the progress bar dots in the output report file I'm checking every now and then - note that the 20-thread job was started earlier so that's why it has progressed further). Unfortunately it seems that both are slowing to similar crawl after some time - I'm not sure if this is about the properties of the samples being processed at the moment (the dataset contains samples with different complexity) or some technical issues. It seems I can't view the statistics about CPU efficiency while the job is still running so I also don't know how dada2 is actually using the cores I'm giving to it.

May I ask which sequencing instrument you used? Sounds bigger than MiSeq (unless you combined several runs). Just asking because I'm also wondering if dada2 is the right way to go with other technologies. Our current data is from AVITI, and as you see, we have been suggested to try perhaps try deblur instead, or try increasing maxEE. But I'm hesitating as we do get a similar % of reads accepted vs. MiSeq, if we truncate the reads at much higher q score threshold than usually with MiSeq.

1 Like

I now tested the same job with only one core and.... well, it's not even past the filtering stage after 3.5 hours :smiley: So, parallel processing seems to work.

And 40 threads might actually be faster than 20 after all, need to see a few more data points... currently running at about 0.06 samples/min vs 0.03-0.04 samples/min with 20 threads.

EDIT: Currently approx 0.046 samples/min using 20 threads and 0.065/min using 40 threads. So not twice as fast but faster (note that I still don't know the optimal RAM per thread).

1 Like

I would also be curious to know if there is a way to subsample or split the data to see what the denoising looks like. If anyone can recommend the proper protocol for this command that would be awesome...Curious though, would I just need to add a metadata file as a parameter with the barcodes I want to subsample for the demux code to only demux the selected barcodes from the emp-paired-end qza file? And then run the denoising on the "subsample demux" qza file?

In case you have not found it already, here's the DADA2 page on 'Big Data'
https://benjjneb.github.io/dada2/bigdata.html

I'm also curious if anyone has experience splitting the dataset into smaller batches, as mentioned. Does this affect the results or is there a recommended way to do this without compromising the analysis?

One of the strengths of DADA2 is that the ASVs it makes should be independent of what other data is included. So yes, splitting the dataset into smaller batches (as small as one sample each!) should work great! This also reduces max RAM usage.

I really appreciate your benchmarking. Great graphs! :bar_chart:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.