How long it takes for learning error rates step of DADA2?

CHIA · December 26, 2024, 6:41am

Hi guys,

I am running the DADA2 and I found this forum (Does learning error rates step of DAD2 have to have all samples under study?).

As attached in the picture, you can see that I am currently at the second steps (assuming I will have the same output with the previous forum). I have let it run for 2 days and it still yet to proceed to next step. May I know how long usually it takes to finish step 2?

timanix · December 26, 2024, 4:24pm

Hello!
The running time depends on multiple factors, such as tech. specifications of the machine you are using, amount of threads you provided in the command, amount of sequences you are denoising and some other factors I forgot to mention or I am not aware of.
From my experience it may take from several hours to several days.
But if you are running a lot of samples from different sequencing runs, you should be aware that it is recommended to denoise samples by sequencing run separately but using the same settings, and then merge feature tables and representative sequences.

Best,

colinbrislawn · December 26, 2024, 5:03pm

What's up with [1]+ Stopped?

I'm wondering if this job was paused or stopped or something

(You can check top to see if it's still using CPU by running the Linux program top. If there's 0% CPU, it is no longer actively running.)

CHIA · December 29, 2024, 4:28am

I kill the previous command as I forget to specify the path to save my output. Currently I can see my CPU is actively running using around 24% of my CPU memory.
I doubt is it because of the low memory consumption making the it running longer time, however I have try to command for sudo ulimit -m unlimited prior to run the subsequent command as shown in this post. Yet, only around 24% of my CPU memory are allocated to run the command.

Chathuranga_De_Silva · December 30, 2024, 1:54am

2.5 hours for 36 samples (V3-V4 341F = CCTACGGGNGGCWGCAG 805R = GACTACHVGGGTATCTAATCC)

Chip:	Apple M3 Pro
Total Number of Cores:	11 (5 performance and 6 efficiency)
Memory:	18 GB

CHIA · December 30, 2024, 3:34am

Hi, may I know what big is your qza file?

Chathuranga_De_Silva · December 30, 2024, 8:58pm

qza file is 1.8 GB.

On a different project a 4.3 GB file took 4.5h

CHIA · January 1, 2025, 4:20am

I see, my qza file is around 5GB yet it still running after 2 weeks. May I know when it was running the command, how much consumption of your CPU memory?

timanix · January 2, 2025, 6:24pm

That a lot of reads! I am not surprised that it is running for a long time.
I am now curious, why you have so big qza file. I can think of 2 possible scenarios:

You have a lot of samples (thousands). Then it is better, as I wrote above, split samples by sequencing run and run separately with the same settings, and merge the outputs.
Sequencing depth is really high. Once I had the dataset with 1M reads per sample, so I just subsampled samples to fraction 0.1.

Chathuranga_De_Silva · January 3, 2025, 1:30am

Well, 5GB in my opinion should run within 6 hours in my specs (11 cores, 18GB RAM, M3).
Two weeks is too long and something is wrong. Please have a look at my qza stat for references.

Chathuranga_De_Silva · January 3, 2025, 1:31am

It was using around 15GB RAM and I couldn't check the CPU usage. But Im sure it used a maximum as I allocated 10 cores out of 11. It was hot and noisy with the cooling fan running almost throughout the run.

colinbrislawn · January 3, 2025, 2:06am

In case you have not found this already, here is the official DADA2 docs for working with 'Big Data'. The strategies discussed there could be helpful for you, even though the examples use DADA2 directly in R instead of through the Qiime2 plugin.

https://benjjneb.github.io/dada2/bigdata.html

CHIA · January 6, 2025, 6:42am

I only running for 8 samples at the moment which it doesn't make sense for me as well for it to run that long

CHIA · January 6, 2025, 6:42am

Thank you for the update and help. Currently I am outstation, will check it when I head back to my lab.

timanix · January 6, 2025, 10:42am

Could you also confirm or answer following questions?

Could you provide such information as number of reads per sample before Dada2 and size in Megabytes?
Are you sure that you are running amplicon data (16S rRNA) and not metagenomic (shotgun sequences)?

Best,

system · February 6, 2025, 4:42pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.