I'm processing a relatively large 16S amplicon seq dataset (approx. 150GB zipped; ~400 samples with ridiculously high numbers of reads as this was done on a new AVITI instrument). The chimera removal step seems to take a VERY long time: this phase has now been running almost 10 days (on a supercomputer node with 40 cores and 160G RAM). The denoising phase took "only" 5 days. This is a problem as we can run max. 2-week jobs on the supercomputer.
Based on pilots with a few samples, we unfortunately have quite a high number of chimeras, which might affect the speed (?). And yes, we have removed adapter/primer/etc. sequences.
Are there any parameters we could adjust to make it faster, or is the only way to split the database?
This is the command we use (with QIIME 2024):
qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-paired-end.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 243
--p-trunc-len-r 204
--p-n-threads 40
--verbose
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats stats-dada2.qza
This is the output so far:
R version 4.2.2 (2022-10-31)
DADA2: 1.26.0 / Rcpp: 1.0.12 / RcppParallel: 5.1.6
2) Filtering .....................................................................................................................................................................................................................................................................................................................................................................................................................................................
3) Learning Error Rates
639443808 total bases in 2631456 reads from 2 samples will be used for learning the error rates.
536817024 total bases in 2631456 reads from 2 samples will be used for learning the error rates.
3) Denoise samples .....................................................................................................................................................................................................................................................................................................................................................................................................................................................
.....................................................................................................................................................................................................................................................................................................................................................................................................................................................
5) Remove chimeras (method = consensus)