Hi, @marctormo,
Important: after digging into this further, I discovered that the tutorial which I quoted above is incorrect in stating that the averages are plotted. In fact, the curves are drawn based on the median.. This fact is correctly documented in the visualization help text:
I think were you may be going wrong is that you need to group by metadata columns first. To be clear, all iterations at a specified depth are taken into account when plotting the curves.
Rather than trying to reproduce the plot by mimicking QIIME 2's operations on the raw, rarefaction results in metric*.csv
, you could export the visualization and read in/plot the raw data in the metric*.jsonp
files, e.g.:
➜ qiime tools export --input-path alpha_rarefaction_curves.qzv --output-path alpha_rarefaction_curves
Exported alpha_rarefaction_curves.qzv as Visualization to directory alpha_rarefaction_curves
[I] ➜ ls alpha_rarefaction_curves
dist observed_features-donor_status.jsonp q2templateassets shannon-genotype.jsonp
index.html observed_features-genotype.jsonp shannon-barcode.jsonp shannon-genotype_and_donor_status.jsonp
observed_features-barcode.jsonp observed_features-genotype_and_donor_status.jsonp shannon-cage_id.jsonp shannon-mouse_id.jsonp
observed_features-cage_id.jsonp observed_features-mouse_id.jsonp shannon-donor.jsonp shannon.csv
observed_features-donor.jsonp observed_features.csv shannon-donor_status.jsonp
That being said, I will try to put the basic, high-level steps of our Python implementation for generating the data used to plot the curves when grouping on a categorical metadata column into plain language. The "real," answer is, of course, in the source code, here.
You must do the following for each metadata column:
- Reindex the data frame on the metadata column.
- Sort the index by rows.
- Group by the metadata column.
- Calculate the median on the grouped data frame.
To illustrate, prior to the first step, your data frame would look something like this:
_alpha_rarefaction_depth_column_ 1 ... 200 pet
iter 1 2 3 ... 9 10
S1 -0.0 -0.0 -0.0 ... 0.998846 0.998846 russ
S2 -0.0 -0.0 -0.0 ... 0.998196 0.999351 milo
S3 -0.0 -0.0 -0.0 ... 1.000000 0.999928 peanut
After the four steps above, you would be working with something like this:
_alpha_rarefaction_depth_column_ 1 ... 200
iter 1 2 3 ... 8 9 10
pet ...
milo -0.0 -0.0 -0.0 ... 0.999928 0.998196 0.999351
peanut -0.0 -0.0 -0.0 ... 0.999711 1.000000 0.999928
russ -0.0 -0.0 -0.0 ... 0.997402 0.998846 0.998846
- The next step is to stack the data frame from columns to index, which would look like this:
iter 1 2 ... 9 10
pet _alpha_rarefaction_depth_column_ ...
milo 1 -0.000000 -0.000000 ... -0.000000 -0.000000
23 0.886541 0.998636 ... 0.932112 0.965636
45 0.999644 0.982474 ... 0.996792 0.982474
67 0.992112 0.999839 ... 0.999839 0.999839
89 0.995533 0.995533 ... 0.992611 0.995533
111 0.999473 0.999473 ... 0.999941 0.999473
133 0.999959 0.999633 ... 0.999959 0.999959
155 0.998528 0.998528 ... 0.999730 0.999970
177 0.999424 0.999793 ... 0.999977 0.999977
200 0.999351 1.000000 ... 0.998196 0.999351
peanut 1 -0.000000 -0.000000 ... -0.000000 -0.000000
23 0.987693 0.987693 ... 0.998636 0.998636
45 0.991076 0.999644 ... 0.999644 0.982474
67 0.972670 0.980468 ... 0.999839 0.999839
89 0.997722 0.995533 ... 0.997722 0.984554
111 0.995253 0.998536 ... 0.995253 0.998536
133 0.999633 0.999959 ... 0.995060 0.999959
155 0.998528 0.999970 ... 0.997567 0.997567
177 0.999793 0.996105 ... 0.999793 0.999977
200 1.000000 0.999351 ... 1.000000 0.999928
russ 1 -0.000000 -0.000000 ... -0.000000 -0.000000
23 0.965636 0.987693 ... 0.886541 0.987693
45 0.999644 0.999644 ... 0.999644 0.996792
67 0.992112 0.998553 ... 0.999839 0.995979
89 0.999180 0.999909 ... 0.984554 0.979412
111 0.992904 0.995253 ... 0.992904 0.998536
133 0.990805 0.999633 ... 0.999633 0.995060
155 0.998528 0.986718 ... 0.999970 0.999730
177 0.997212 0.994813 ... 0.999793 0.996105
200 0.999351 0.998846 ... 0.998846 0.998846
It is on this stacked version of the data frame that we calculate the five-number summaries, used to plot the data for a given metadata column.