Using the silva pre-trained classifier consumes more than 256 GB of memory

BenKaehler · September 15, 2017, 8:12pm

Thanks for the closure, @yoshiki.

I would guess that the reason that all the reads went to the first worker is that chunk_size is set to be large enough to effectively turn off chunking. To get some sort of reasonable parallelism it would probably be better to set chunk_size to be, maybe, 1,000.

I've added some parameter documentation to classify-sklearn. This is what it looks like now:

$ qiime feature-classifier classify-sklearn --help
Usage: qiime feature-classifier classify-sklearn [OPTIONS]

  Classify reads by taxon using a fitted classifier.

Options:
  --i-reads PATH                  Artifact: FeatureData[Sequence]  [required]
                                  The feature data to be classified.
  --i-classifier PATH             Artifact: TaxonomicClassifier  [required]
                                  The taxonomic classifier for classifying the
                                  reads.
  --p-chunk-size INTEGER          [default: 262144]
                                  Number of reads to process
                                  in each batch.
  --p-n-jobs INTEGER              [default: 1]
                                  The maximum number of
                                  concurrently worker processes. If -1 all
                                  CPUs are used. If 1 is given, no parallel
                                  computing code is used at all, which is
                                  useful for debugging. For n_jobs below -1,
                                  (n_cpus + 1 + n_jobs) are used. Thus for
                                  n_jobs = -2, all CPUs but one are used.
  --p-pre-dispatch TEXT           [default: 2*n_jobs]
                                  "all" or expression, as
                                  in "3*n_jobs". The number of batches (of
                                  tasks) to be pre-dispatched.
  --p-confidence FLOAT            [default: 0.7]
                                  Confidence threshold for
                                  limiting taxonomic depth. Provide -1 to
                                  disable confidence calculation, or 0 to
                                  calculate confidence but not apply it to
                                  limit the taxonomic depth of the
                                  assignments.
  --p-read-orientation [same|reverse-complement]
                                  [optional]
                                  Direction of reads with respect
                                  to reference sequences. same will cause
                                  reads to be classified unchanged; reverse-
                                  complement will cause reads to be reversed
                                  and complemented prior to classification.
                                  Default is to autodetect based on the
                                  confidence estimates for the first 100
                                  reads.
  --o-classification PATH         Artifact: FeatureData[Taxonomy]  [required
                                  if not passing --output-dir]
  --output-dir DIRECTORY          Output unspecified results to a directory
  --cmd-config PATH               Use config file for command options
  --verbose                       Display verbose output to stdout and/or
                                  stderr during execution of this action.
                                  [default: False]
  --quiet                         Silence output if execution is successful
                                  (silence is golden).  [default: False]
  --help                          Show this message and exit.