Is it possible to run the denoise process in parallel?

linneakh · March 15, 2017, 2:52am

Running the dada2 denoise command is taking a really long time. Is it possible to run this process in parallel? In the announcement for qiime2-2017.2, it says a new feature is “Parallel processing support and other performance enhancements for DADA2”.

Thanks.

thermokarst · March 15, 2017, 3:02am

Hi @linneakh! You can pass the --p-n-threads parameter to either denoise-single or denoise-paired to control the number of threads used for processing. The parameter takes an integer (the number of threads dada2 should use when running), or, if you specify 0, it will use all available cores.

For future reference, you can pass the --help flag to any method or visualization in a plugin to get detailed information about the relevant inputs, outputs, and parameters.

$ qiime dada2 denoise-single --help

Usage: qiime dada2 denoise-single [OPTIONS]

  This method denoises single-end sequences, dereplicates them, and filters
  chimeras.

Options:
  --i-demultiplexed-seqs PATH     Artifact:
                                  SampleData[PairedEndSequencesWithQuality |
                                  SequencesWithQuality]  [required]
                                  The
                                  single-end demultiplexed sequences to be
                                  denoised.
  --p-trunc-len INTEGER           [required]
                                  Position at which sequences
                                  should be truncated due to decrease in
                                  quality. This truncates the 3' end of the of
                                  the input sequences, which will be the bases
                                  that were sequenced in the last cycles.
                                  Reads that are shorter than this value will
                                  be discarded.
  --p-trim-left INTEGER           [default: 0]
                                  Position at which sequences
                                  should be trimmed due to low quality. This
                                  trims the 5' end of the of the input
                                  sequences, which will be the bases that were
                                  sequenced in the first cycles.
  --p-max-ee FLOAT                [default: 2.0]
                                  Reads with number of expected
                                  errors higher than this value will be
                                  discarded.
  --p-trunc-q INTEGER             [default: 2]
                                  Reads are truncated at the
                                  first instance of a quality score less than
                                  or equal to this value. If the resulting
                                  read is then shorter than `trunc_len`, it is
                                  discarded.
  --p-n-threads INTEGER           [default: 1]
                                  The number of threads to use
                                  for multithreaded processing. If 0 is
                                  provided, all available cores will be used.
  --p-n-reads-learn INTEGER       [default: 1000000]
                                  The number of reads to
                                  use when training the error model. Smaller
                                  numbers will result in a shorter run time
                                  but a less reliable error model.
  --p-hashed-feature-ids / --p-no-hashed-feature-ids
                                  [default: True]
                                  If true, the feature ids in
                                  the resulting table will be presented as
                                  hashes of the sequences defining each
                                  feature. The hash will always be the same
                                  for the same sequence so this allows feature
                                  tables to be merged across runs of this
                                  method. You should only merge tables if the
                                  exact same parameters are used for each run.
  --o-table PATH                  Artifact: FeatureTable[Frequency]  [required
                                  if not passing --output-dir]
                                  The resulting
                                  feature table.
  --o-representative-sequences PATH
                                  Artifact: FeatureData[Sequence]  [required
                                  if not passing --output-dir]
                                  The resulting
                                  feature sequences. Each feature in the
                                  feature table will be represented by exactly
                                  one sequence.
  --output-dir DIRECTORY          Output unspecified results to a directory
  --cmd-config PATH               Use config file for command options
  --verbose                       Display verbose output to stdout and/or
                                  stderr during execution of this action.
                                  [default: False]
  --help                          Show this message and exit.

The same help text is also available on the doc site.