Q2-cscs community plugin for metabolomics data

madeleineernst · August 21, 2018, 2:02pm

Hi QIIME2 community,

Asker Brejnrod and myself have a first version of the “q2-cscs” plugin ready. It can be accessed at GitHub - madeleineernst/q2-cscs: An implementation of the chemical structural and compositional dissimilarity metric in qiime2 and is also conda installable from Q2 Cscs | Anaconda.org.

The plugin computes the chemical structural and compositional dissimilarity metric for MS/MS metabolomics samples. A short motivation section and example tutorial is provided on Github.
We are looking forward to your comments!

Kind regards,
Madeleine (mernst@ucsd.edu) and Asker (brejnrod@sund.ku.dk)

gregcaporaso · August 21, 2018, 4:43pm

Hi @madeleineernst,
Thanks so much for this contribution - it is great to see that this plugin ready to use! I think this is our first metabolomics plugin as well, which is extremely exciting!

I have a few comments on your tutorial and the plugin that I think might make them even more useful.

First, I tested with QIIME 2 2018.6 and it seems to work. Your tutorial mentions activating 2018.4 - you should be able to update that to 2018.6.

I was a little confused about which files I should be using after downloading the ProteoSAFe-METABOLOMICS-SNETS-5729dd0f-download_cluster_buckettable.zip file. It would be helpful to note that the two files used in the tutorial will be in that zip file (I had originally thought that I was looking for three files to download based on the three numbered items in your screenshots). You could then include the commands:

unzip ProteoSAFe-METABOLOMICS-SNETS-5729dd0f-download_cluster_buckettable.zip
cp ProteoSAFe-METABOLOMICS-SNETS-5729dd0f-download_cluster_buckettable/METABOLOMICS-SNETS-5729dd0f-download_cluster_buckettable-main.tsv ./GNPS_buckettable.tsv
cp ProteoSAFe-METABOLOMICS-SNETS-5729dd0f-download_cluster_buckettable/networkedges_selfloop/c8a76183cbe644a194408b514ba51632.pairsinfo GNPS_edges.tsv

so that the files you use in your commands have the same names as files the user has in their current directory.

Is possible to provide a command for the user to download the zip file (e.g., using curl or wget)? If so, we'll ultimately be able to automatically test this tutorial for you with new releases of QIIME 2 so that you can be alerted if something breaks.

I notice that you currently import the biom file with a FeatureTable[Frequency] semantic type. The Frequency part of that implies that the values in this table are counts, and that this could therefore be used with other QIIME 2 actions that require counts (such as qiime diversity rarefy, which will subsample the counts in the table without replacement to a user-specified total frequency per sample). If this isn't the right type, could you describe what these values are, and we can chat about whether a more appropriate type exists or should be created? This will help us prevent users from mistakenly misusing their data (e.g., by trying to rarefy this table if that's not appropriate). I haven't worked a lot with metabolome data, so maybe these are actually counts, in which case you should just ignore this comment.

At some point, it would be worth seeing if there would be a better way to pass the --p-css-edges file. Because it's specified as a parameter, in a graphical interface the user wouldn't get a file selection box, so it might be hard or impossible to pass this path as a parameter through a GUI. Making it feature metadata instead should work, but I realize this is pairwise data so it doesn't exactly match with that concept. That's probably something we need to think about supporting at the framework level.

Is there a test data set that you could use that would require less computation time? The qiime cscs cscs step took over an hour to run for me. It's helpful for testing, and for using these tutorials in workshops, if the test analysis can run quickly (a minute or two at most). That's of course not always possible though.

I realize this is a lot of info, but just think of this as some general feedback - not a list of urgent to-do items. This is very exciting and useful to have as-is! Thanks for your interest in contributing to QIIME 2!

gregcaporaso · August 21, 2018, 5:46pm

@madeleineernst, I ended up poking through your GitHub repository a little more and have a couple of follow-ups.

I noticed that you have all of the example files in your GitHub repo. I would recommend including the following commands toward the top of your tutorial, maybe in a section titled Obtaining tutorial data.

wget https://raw.githubusercontent.com/madeleineernst/q2-cscs/master/Example/GNPS_buckettable.tsv
wget https://raw.githubusercontent.com/madeleineernst/q2-cscs/master/Example/GNPS_edges.tsv
wget https://raw.githubusercontent.com/madeleineernst/q2-cscs/master/Example/MappingFile_UrineSamples.txt

Since there is only one output from your call to qiime cscs cscs, I recommend that your example write a file, rather than a directory containing a single file. You could change this to the following:

qiime cscs cscs --p-css-edges GNPS_edges.tsv --i-features GNPS_buckettable.qza --p-cosine-threshold 0.5 --p-normalization --o-distance-matrix cscs_distance_matrix.qza

Some users will likely be interested in Mantel tests and/or Procrustes analysis comparing the results of the two distance computations. You could add these commands in at the end if you'd like to help users with that.

qiime diversity mantel --i-dm1 out/distance_matrix.qza --i-dm2 braycurtis_GNPS_buckettable.qza --o-visualization mantel.qzv
qiime diversity procrustes-analysis --i-reference braycurtis_PCoA.qza --i-other cscs_PCoA.qza --output-dir ./procrustes-out
qiime emperor procrustes-plot --i-reference-pcoa procrustes-out/transformed_reference.qza --i-other-pcoa procrustes-out/transformed_other.qza --m-metadata-file MappingFile_UrineSamples.txt --o-visualization procrustes-out/plot.qzv

Ok, that's all! Hope you don't mind all of the feedback, I got excited about playing with this this morning.

madeleineernst · August 22, 2018, 12:41pm

Hi Greg,

Thanks for your enthusiastic and thorough reply. We are happy to hear that other people can make it work, even if there is room for improvement. Several of your points are issues that we considered but dropped either due to lack of time or lack of familiarity with the qiime2 idiomatic way of doing things and a few follow-up questions have also presented themselves in light of your reply.

We will try to reply point by point to make the conversation more clear.

We have actually used 2018.6 throughout development I don’t know how this sneaked in. Fixed!
We have had many discussions about automatically downloading the files, it is an obviously good idea. Indeed it is possible and the functionality is implemented in our corresponding R package using curl (GitHub - askerdb/rCSCS: R implementation of CSCS). As you know there is a plugin in development dedicated to GNPS integration and we felt this functionality would be more appropriate there, and when that is done we can simply update the tutorial and hopefully avoid the confusion over the files. There is also some additional questions about types for the edges file addressed in point 4.
We assume for now that the data is in the form of a GNPS “buckettable”, that consists of mass spec scan counts. The FeatureTable[Frequency] semantic type is also the output of the plugin dedicated to GNPS integration. It is possible to get useful data in the form of ion intensities, i.e. unbounded positive real numbers, but we kind of ignored it for now because it is not available from GNPS and there is no good semantic type for this. It is not obvious that it makes sense to do rarefaction on these counts, but that is a research topic in itself.
As you note, there is no obvious semantic type for the --p-css-edges file, so we chose this solution as it also avoids an extra conversion step in the manual download process. We were not aware of the implications for qiime2 GUIs, but with that in mind it would certainly make sense to address this. Additionally it feels like it would be quite useful to have a SimilarityMatrix type (for the cosine similarity matrix as the edges is converted to), but there is no such type in skbio, and developing one ourselves would be quite non-trivial.
We have some mock data that is small and fast, and we could put that in front of the tutorial as a “quick start guide”.
We were not aware that you could output the distance matrix as just a file instead but that is an obvious improvement!
We considered adding the procrustes and mantel tests and we are still open for doing it if you believe there is interest. Ultimately, we decided that this would be more exciting with distance matrices from matching microbiomes and metabolomes, but that would require us to come up with another example and the deadline put it on the “nice to have” list. From a paper-writing point of view it would of course underscore the omics integration possible in qiime2.

Thanks for your help with this!

Regards,

Madeleine and Asker

gregcaporaso · August 23, 2018, 5:37pm

Hi @madeleineernst, Thanks for the reply!

Great!
Sounds good! We mostly care about this because it will help us to automate the execution of the documentation, which serves as important integration tests for us (and ensures that users don't run into broken commands when they're working through our tutorials, which we've all experienced before and know is super-frustrating).
Ok, sounds good. I know way less about this than your group does, so I defer to you on this. If you think that new semantic types make sense, let us know and we'll be happy to chat about these with you. They are easy to define, and can be defined in the plugins themselves, so it's not a big deal if you decide that you do want them.
Would you be willing to open an issue to improve this on the q2-cscs issue tracker? If you link me to that issue, I can chat about it with the core developers and maybe we can get someone to issue a PR to your plugin that would improve this (e.g., by creating a SimilarityMatrix semantic type, ...).
That sounds good.
Great!
I think it could be interesting to do here comparing the diversity metrics (but my commands are already in this discussion thread, so maybe that's good enough - interested users could find them). Doing this for microbiome and metabolome data from the same samples would definitely be a nice-to-have example - I agree that feels like a longer term thing. That could be part of a multi-omics overview tutorial when we get there.

Thanks again for all of the work on this!

madeleineernst · August 27, 2018, 3:15pm

Hi Greg,

We have now addressed points 1., 5. and 6. by updating the documentation on the Github repository. For the remaining points some comments:

2.Besides the small unit test dataset, which can be executed in an automated manner, we additionally also provide an option for automatic download of the real-world dataset using curl. Longer term, we would like to integrate the automated download through the plugin dedicated to GNPS integration, which is currently under development.

3.We think that the FeatureTable[Frequency] semantic type is fine for now as it fits the plugin dedicated to GNPS integration semantic output type.

4.We have opened an issue on the q2-issue tracker regarding this subject: Create SimilarityMatrix semantic type · Issue #1 · madeleineernst/q2-cscs · GitHub

7.For now we added the code lines you suggested to compare the two diversity metrics using a Mantel test and Procrustes analysis to the documentation.

Thanks a lot again for your suggestions for improvement! We hope that we were able to address some of them and will continue working on the longer term suggestions once plugins/data becomes available!

Kind regards,

Madeleine and Asker

askerdb · August 27, 2018, 7:41pm

Hi Greg,

Additionally, i would also like to ask about unit tests.
We have been trying to come up with useful tests that cover the fact that real data is quite a bit more complicated than the mock set, but it is not quite clear how to make the most meaningful tests of this.

Could you share some of your experience with this, i assume the situation is quite similar for unifrac. Then i will go ahead and implement something.

Asker

gregcaporaso · August 31, 2018, 3:11pm

Hi @askerdb,
The usual approach is to have a very simple data set that you can use for testing of edge cases and basic functionality. It's important this is small enough that you can compute the expected results manually. These tests should be very extensive, covering both valid and invalid input, ensuring that you achieve the minimum and maximum values of the metric when it's appropriate, ensuring that appropriate and helpful error messages are raised when necessary, ... (If you want some more input on what to test, check out How to Break Software - though there are likely more modern books on the topic.)

Then, you would usually have one or two real-world data sets that you run a couple of tests on. More is better - it just depends on how time consuming it is to compute the expected results, which is the hard part. If there is a paper on the metric which contains data and presents results of computing the metric on that data, you can build your "real world" tests around that data (confirm that your implementation reproduces the results generated on real-world data in the paper). Alternatively, if there is a reference implementation of your metric that is well-tested (and ideally implemented by someone else), you could generate a test data set, run it through that implementation, and then write tests that confirm that your method gets those same results.

This is a complicated process, and it's important to get right. If done right, writing unit tests typically takes longer than writing the code that is being tested. It's worth the time investment though, and you'll get quicker with more experience. When a user inevitably gets in touch with you and tells you they think they've found a bug in your software (which will happen after your group and multiple other groups have published high-impact findings based on your code), you'll be thankful for all the time you spent writing tests. It of course may still be a legitimate bug, but if you have an extensive unit test suite then there's a good chance it's not actually a bug but rather a misunderstanding of the method by the user. And, if it is a bug, at least you'll know that you did your due diligence in trying to avoid the issue up front.

You can find the UniFrac test suite here in case that is a useful reference.

Hope this helps!