Chimera removal taking VERY LONG time

I'm processing a relatively large 16S amplicon seq dataset (approx. 150GB zipped; ~400 samples with ridiculously high numbers of reads as this was done on a new AVITI instrument). The chimera removal step seems to take a VERY long time: this phase has now been running almost 10 days (on a supercomputer node with 40 cores and 160G RAM). The denoising phase took "only" 5 days. This is a problem as we can run max. 2-week jobs on the supercomputer.

Based on pilots with a few samples, we unfortunately have quite a high number of chimeras, which might affect the speed (?). And yes, we have removed adapter/primer/etc. sequences.

Are there any parameters we could adjust to make it faster, or is the only way to split the database?

This is the command we use (with QIIME 2024):

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-paired-end.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 243
--p-trunc-len-r 204
--p-n-threads 40
--verbose
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats stats-dada2.qza

This is the output so far:

R version 4.2.2 (2022-10-31)
DADA2: 1.26.0 / Rcpp: 1.0.12 / RcppParallel: 5.1.6
2) Filtering .....................................................................................................................................................................................................................................................................................................................................................................................................................................................
3) Learning Error Rates
639443808 total bases in 2631456 reads from 2 samples will be used for learning the error rates.
536817024 total bases in 2631456 reads from 2 samples will be used for learning the error rates.
3) Denoise samples .....................................................................................................................................................................................................................................................................................................................................................................................................................................................
.....................................................................................................................................................................................................................................................................................................................................................................................................................................................
5) Remove chimeras (method = consensus)

Hello @Mikael_Niku, splitting the dataset would interfere with parts of the dada2 algorithm that rely on the entire dataset being present to get accurate results which unfortunately is most of them which makes this an absolute last resort.

I would suggest you max out the CPUs you are able to use if you haven't yet and increase the memory accordingly. You could also ask your sysadmins about giving you an exception for this extremely long running job.

If this doesn't work (or if you have already done this) then if you are comfortable with R, it may be possible for you to use dada2 in R directly to run each of the individual dada2 steps in separate jobs. This would happen outside of QIIME 2 and would not use QIIME 2 artifacts and as such would also not track QIIME 2 provenance, but it could get you results that I believe could be imported to QIIME 2 in the end though you may lose the denoise-stats viz by default it would theoretically be possible to generate that as well. This would likely be a very involved process.

1 Like