Diversity statistical testing export

Hi all,

I've written some code to interrogate the the visualizations from q2-diversity statistical testing using the artifact API. This exports a csv containing the group significance or correlation results for every metadata column, letting the user avoid manually going through every column in the visualization and having to type out the beta-correlation testing commands for each column. I had an issue that @thermokarst helped me solve on this post, and when I brought up the question of if it could be contributed to the plugin, he suggested I post over here to discuss it. I've been working to learn more about the QIIME 2 framework (reading through the developer documentation and reading source code) with the intent of contributing, but I haven't seen many good first issues raised recently that are within my skillset, so I thought this might be a good first project to work toward.

Here are some of my main questions:

  1. Is there any interest in this function? If so, how is a plan typically developed for contributing?

  2. Right now, the process includes exporting, parsing, and deleting the visualization files. Would it be better to focus on an artifact-based approach that doesn't involve creating temporary files?

  3. Are there concerns about potential abuse of multiple hypothesis testing with this? Would the results need to be corrected for false discovery rate?

Thanks!

2 Likes

Hey there @sterrettJD, sorry for the lack of activity here, I am planning on responding in more detail, soon. Thanks!

:qiime2:

Hey there @sterrettJD! Sorry for the slow reply, things have been busy here at Q2HQ!

I think there is a lot of value in being able to compute beta-group-significance for multiple metadata columns at once!

We usually discuss big(ish) changes like this by preparing something like an informal RFC - this forum thread would be a great place to plan the work (there are a few examples of this floating around here in the "Developer Discussion" category).

Exporting data breaks the provenance for this process, so I think it would be best to focus on alternative approaches. I think an Artifact-based approach could work, although it might require the creation of another Method+Type+Format, which isn't necessarily an issue, but might become a pretty large project.

I am wondering if we can take the approach of updating the existing visualizer to accept multiple metadata columns, to compute everything in one shot.

That is a great question, I am not sure, I don't have a background in statistics - @jwdebelius or @Nicholas_Bokulich, any thoughts here?

Thanks so much @sterrettJD, looking forward to continuing this discussion!

1 Like

I think this is a really good idea @sterrettJD, and I'd love to see this! And thanks @thermokarst for tagging me in!

I think it might be useful to separate out the computational step from the visualization step here, to allow for more concatenation, but may not be practical. You could then calculate the artefacts, pass in a list, and then propagate with provenance? It's something I've been wanting on adonis for a while (also adonis2, but that's probably a separate discussion and maybe something for me to look at later).

But, i think i got tagged in to play a statistician :movie_camera:...

I think there are always multiple hypothesis concerns. My major concern would be cases in which p-values are between 0.05 and 1/number of permutations, because that's where you reallly need FDR. If p=1/(number permutations + 1), then we can't find anything more extreme and while you could penalize, you're well past the limits of random chance. However, when you've got a p-value that indicates some permutations were signficiant, that's when you might want to worry. ... When i do this for myself, i tend to do a first pass of whole covariates uncorrected, and then I do a second round with either pairwise testing or adonis. I'm not sure the average user will know to do that. so, one option would be to only allow testing of a variable without post-hoc testing to limit the space and then do a more permissive correction here (i think there are permutation specific corrections, but I'm not sure). If people want to dig for specific groups, they could do the standard beta-group-significance with post-hoc. It also potentially makes this more directed.

Best,
Justine

2 Likes

Hi @thermokarst and @jwdebelius, thank you both for the input! How does this sound?

Combining both of these:

Using beta group significance as an example, would it be possible to first update the beta_group_significance() function to accept a qiime2.Metadata type, filter all categorical columns, and loop through each column to run the rest of the function?

Based on what's already done in the alpha_group_significance function, files for the results of individual columns would be created, and the visualizer template could be updated to have a column selection dropdown menu like the one seen in the alpha group significance viz. I haven't ever worked in html or javascript (though I'd love to learn through this), but I imagine this would require a load_data function similar to the following from the alpha group sig template:

  var d = [];
  function load_data(columnName, data, filtered, kwAll, kwPairwise, kwCSVPath, metricName) {
    d.push({
      column: columnName,
      data: data,
      kwAll: kwAll,
      kwPairwise: kwPairwise,
      filtered: filtered,
      kwCSVPath: kwCSVPath,
      metricName: metricName,
    });
  } 

Additionally, to address the initial goal here of exporting all of the results into one tsv, a "Download results as a TSV" button could be added to all diversity visualization templates, which would require a function to concatenate all of the tsv files used for the visualization. Would this all-statistics-results.tsv file need to be created prior to that button being pressed, or could it be created after that is pressed?Alternatively, would it be best to just export it as an output of the beta-group-significance function itself?

I'm a bit confused by this, which I think is primarily a result of my inexperience with post-hoc testing. I'm going to read up some more on adonis and other post-hoc tests, but could you clarify what you mean by "to limit the space" and where exactly "here" is? Are you suggesting that in the proposed function for testing all columns, there should be no --p-pairwise option? If so, that sounds like a good plan to me!

I like this idea as well! I know this is not your point here, but as far as structure is concerned, I was wondering if this means that the proposed function for testing all columns should be a new function, such as beta-group-significance-all? Or maybe the proposed function could be run if no column is specified (ie if column is None: test_all_columns())?

I really appreciate the feedback, so thank you both!

Hi @sterrettJD,

sorry for my lack of clarity!

Yes! I think that you should avoid pairwise testing.

I'm not sure if a new command will confuse people more or not. @thermokarst?

Best,
Justine

1 Like

I don't think we want to get into the business of operating across all Metadata columns - that can potentially be huge and quite expensive, computationally. I think an approach that would be a bit more flexible, and would also allow for backwards-compatibility is instead if the the existing beta-group-significance visualizer was modified to accept a Set of metadata columns. We aren't currently able to register Set[MetadataColumn[Categorical]], pinging @ebolyen for a reminder on what might be blocking that (if anything - it might just be a case of not implementing it yet). A while back @ebolyen and I had a long discussion about a modified metadata access syntax, which might have some implications here, as well, but I can't recall the specifics. :brain: :fog:

2 Likes

That makes perfect sense to me.

Any news/ideas on what the issues with registering Set[MetadataColumn[Categorical]] might be? If it's just a case of it not being implemented yet, I'd be happy to look into working on that. My semester is wrapping up over the next week, and I think I'll have some time over the next month to dive into it a bit.

(Apologies for the delayed response, I was waiting to see if there were any updates on the Set[MetadataColumn[Categorical]])

2 Likes

Hi @sterrettJD - @ebolyen was out of the office for a while, but hopefully this is back on his radar. More soon.