core-metrics-phylogenetic: why so much required output?

nick-youngblut · November 2, 2020, 11:21am

The core-metrics-phylogenetic cli help docs in 2020.8.0 state:

Outputs:
  --o-rarefied-table ARTIFACT FeatureTable[Frequency]
                          The resulting rarefied feature table.     [required]
  --o-faith-pd-vector ARTIFACT SampleData[AlphaDiversity]
                          Vector of Faith PD values by sample.      [required]
  --o-observed-features-vector ARTIFACT SampleData[AlphaDiversity]
                          Vector of Observed Features values by sample.
                                                                    [required]
  --o-shannon-vector ARTIFACT SampleData[AlphaDiversity]
                          Vector of Shannon diversity values by sample.
                                                                    [required]
  --o-evenness-vector ARTIFACT SampleData[AlphaDiversity]
                          Vector of Pielou's evenness values by sample.
                                                                    [required]
  --o-unweighted-unifrac-distance-matrix ARTIFACT
    DistanceMatrix        Matrix of unweighted UniFrac distances between
                          pairs of samples.                         [required]
  --o-weighted-unifrac-distance-matrix ARTIFACT
    DistanceMatrix        Matrix of weighted UniFrac distances between pairs
                          of samples.                               [required]
  --o-jaccard-distance-matrix ARTIFACT
    DistanceMatrix        Matrix of Jaccard distances between pairs of
                          samples.                                  [required]
  --o-bray-curtis-distance-matrix ARTIFACT
    DistanceMatrix        Matrix of Bray-Curtis distances between pairs of
                          samples.                                  [required]
  --o-unweighted-unifrac-pcoa-results ARTIFACT
    PCoAResults           PCoA matrix computed from unweighted UniFrac
                          distances between samples.                [required]
  --o-weighted-unifrac-pcoa-results ARTIFACT
    PCoAResults           PCoA matrix computed from weighted UniFrac
                          distances between samples.                [required]
  --o-jaccard-pcoa-results ARTIFACT
    PCoAResults           PCoA matrix computed from Jaccard distances between
                          samples.                                  [required]
  --o-bray-curtis-pcoa-results ARTIFACT
    PCoAResults           PCoA matrix computed from Bray-Curtis distances
                          between samples.                          [required]
  --o-unweighted-unifrac-emperor VISUALIZATION
                          Emperor plot of the PCoA matrix computed from
                          unweighted UniFrac.                       [required]
  --o-weighted-unifrac-emperor VISUALIZATION
                          Emperor plot of the PCoA matrix computed from
                          weighted UniFrac.                         [required]
  --o-jaccard-emperor VISUALIZATION
                          Emperor plot of the PCoA matrix computed from
                          Jaccard.                                  [required]
  --o-bray-curtis-emperor VISUALIZATION
                          Emperor plot of the PCoA matrix computed from
                          Bray-Curtis.                              [required]

Why do all of these have to be required? Maybe the user doesn't want all of that output. How about changing all of these to option, and if the user doesn't provide a path, then that analysis isn't done and no file is written.

timanix · November 2, 2020, 11:51am

Hi!
You can provide a directory to put all output files. No need to indicate each file separately.
You are looking into "pipeline", the aim of which to calculate a bunch of diversity metrics in one run with the same comparable settings and from same rarefied table.
You always can use other diversity plugins from documentation to produce selected metrics separately.

nick-youngblut · November 2, 2020, 12:08pm

So the [required] are not correct, and should be removed, right? They are not actually required.

nick-youngblut · November 2, 2020, 12:14pm

Also, the output directory option still requires that all analyses be completed, even if the user doesn't want all of the output (eg., all except one of the diversity metrics). If the code is written in a modular manner, then it should be able to easy to include if path not provided, don't do analysis logic. This allows the user to use just 1 command (core-metrics-phylogenetic) versus typing in all of the command separately in the pipeline just in order to a bit of the output of core-metrics-phylogenetic.

thermokarst · November 2, 2020, 2:34pm

Hey @nick-youngblut, great questions!

Because this is a convenience pipeline - it bundles up a bunch of the more common diversity analyses into one step. Since its built as a QIIME 2 Pipeline Action, that means it is actually composed of multiple discrete steps - all of those discrete steps are individual QIIME 2 Method or Visualizer Actions, and can be run on their own, allowing you to create your own set of steps (more on that below).

This is an interesting idea, but it goes against the design ethos of QIIME 2 - each QIIME 2 Action has a known set of input parameters, and produces a known set of outputs - this is deterministic, and we think easier to reason about for people.

The --output-dir option doesn't change the fact that the outputs are required - you're just opting into allowing q2cli to come up with the filenames to save to, for you.

Luckily it is, please see the following composable Actions that make up this Pipeline:

Hope that helps clear things up! :qiime2:

thermokarst · November 2, 2020, 2:53pm

PS, @nick-youngblut - if you would like to explore building a QIIME 2 Pipeline that does what you want, this would be the place to discuss that (here in "Developer Discussion")! You could make a q2-nick-youngblut plugin that composes those actions that most interest you, and it might help give you some more insight into the design of QIIME 2 - we would love to lend a hand if that is something you want to do. :qiime2:

nick-youngblut · November 2, 2020, 3:22pm

Thanks for the feedback! I see the logic behind the pipeline, but it is a bit confusing when the user sees a very long list of parameters with the [required] label. I was just trying to suggest a way to make core-metrics-phylogenetic more flexible (the user could choose which output to get) while making the cli docs a bit easier to understand (no [required] then needed for all output params).

I did what I'm guessing most users do that don't want all of the output: I created a long list of qiime diversity * commands to generate the individual outputs. It's many more commands, but it gets the job done.

nick-youngblut · November 3, 2020, 7:28am

Thanks @thermokarst or the suggestion of creating a plugin. You've inspired me to consider creating one. 2 ideas that I have:

q2-GTDB
- mapping taxonomies to the GTDB (already have some code at GitHub - nick-youngblut/gtdb_to_taxdump: Convert GTDB taxonomy to NCBI taxdump format)
- based on the taxonomy mappings, listing stats from GTDB about those taxonomic groups (eg., how many genomes are available and what quality)
q2-cli-view
- an easy way to directly write text portions of qza files (eg., feature counts) to stdout for rapid inspection via unix commands (without having to use qiime tools export and possibly biom for converting to text)

Any thoughts on these ideas would be greatly appreciated.

SoilRotifer · November 3, 2020, 5:33pm

Hi @nick-youngblut, given your interest in:

We have a good first issue, to start adding GTDB related functionality to RESCRIPt.

-Cheers!
-Mike

nick-youngblut · November 4, 2020, 7:36am

Thanks Mike for the heads up! I'll look into contributing.

thermokarst · November 5, 2020, 3:38pm

Neat! This sounds like a limited QIIME 2 interface, and would be pretty cool to see. Keep us posted and let us know if you want to chat more about it.

:qiime2: