UniFrac Benchmarks

Daniel_Smith · September 27, 2018, 1:57pm

I'm running benchmark tests on several UniFrac implementations as part of a manuscript I'm writing, and wanted to share my results for QIIME 2 to confirm with it's developers that I'm seeing what they'd expect. If anything looks amiss please let me know. Thanks!

I'm using qiime2-2018.6 on a Xeon E5-2670 v2 2.50GHz machine with 256 GB ram and 20 physical cores running Redhat natively. All I/O is being done to/from a ram drive. QIIME 2 is being invoked using the following script, the entirety of which is timed:

if [ $2 = "u" ]; then metric="unweighted_unifrac"; else metric="weighted_unifrac"; fi
qiime tools import --input-path in/hmp$3.json --output-path json --type FeatureTable[Frequency] --source-format BIOMV100Format
qiime tools import --input-path in/hmp$3.tre --output-path tree --type Phylogeny[Rooted] --source-format NewickFormat
qiime diversity beta-phylogenetic-alt --i-table json.qza --i-phylogeny tree.qza --p-metric $metric --output-dir qiime_out --p-n-jobs $1
qiime tools export --output-dir qiime_out qiime_out/distance_matrix.qza
mv qiime_out/distance-matrix.tsv out/qiime2_$2_dm$3.tsv
rm -rf json.qza tree.qza qiime_out

where $1 is the number of threads, $2 is "u" or "w", and $3 is the dataset size (number of samples, all pre-selected from the HMP 16S dataset and pre-rarefied to 1000 reads/sample).

If you have any suggestions for modifying the above script in order to decrease the wall clock time required to go from a biom and newick file to a tsv matrix of UniFrac distances, I'd be more than happy to implement them.

Thanks again for all your team's great work on this software!

colinbrislawn · September 27, 2018, 2:26pm

Thank you for posting this, Daniel! I would love to see faster UniFrac.

This new method (preprint PDF, github page) appears to scale very well, but I'm not sure how it's it's easily parallelizable.

Let see what the qiime devs recommend.

Colin

wasade · September 27, 2018, 11:15pm

Hi @Daniel_Smith,

Those curves differ from what I'd expect with Striped UniFrac (beta-phylogenetic-alt). I'm particularly concerned about the lack of a difference in runtime with multiple threads for weighted UniFrac. Can you confirm that, when run, that there actually is increased processor utilization with increasing numbers of threads? If not, there may be a bug.

As a sanity check, you can use the Striped UniFrac algorithm directly using the binary ssu that's in your environment if you want to execute independent of QIIME 2 for testing. It's also possible to execute directly via Python (e.g., import unifrac; unifrac.unweighted(...)).

A possibly important point, the library in conda is pre-compiled, and is generic. The library does rely on compile time auto vectorization, which can be compiler and architecture specific. In other words, the compiled shared library being indirectly used may not be optimized for your architecture.

As a point of context, I can compute unweighted UniFrac on the Earth Microbiome Project on my laptop using four cores in less than a day when running directly against the binary (that was on a mid 2016 Macbook Pro). That dataset is 27k samples and the phylogeny has around 320k tips. We also see near linear parallel scaling out to 16 cores if hyperthreading is respected. Though we have observed some potential confounding with non-uniform memory access (NUMA) boundaries.

@colinbrislawn, EMDUniFrac is related to Striped UniFrac in the postorder reduction performed, but they differ substantially in how samples are compared, and in how they can be parallelized. We are working with Prof. Koslicki on a manuscript at the moment on the new algorithm. Striped UniFrac can be found here, which is the library behind beta-phylogenetic-alt.

Best,
Daniel

Daniel_Smith · September 27, 2018, 11:45pm

Hi @wasade,

Yikes, that was an oversight on my part. Since beta-diversity didn't support parallelized computation of weighted UniFrac, I'd assumed beta-diversity-alt didn't either. I just tested it though and it definitely does. I'll re-run those and post an update early next week with the new numbers.

I was surprised also that unweighted UniFrac didn't show better scaling with additional cores. I'll give it a run on AWS using the QIIME 2 AMI to see if it's just my particular environment it didn't like.

@colinbrislawn, thanks for pointing me to EMDUniFrac! It wasn't on my radar before but I'll be including it now.

Thanks again,
Dan

wasade · September 28, 2018, 12:35am

To clarify, the principle improvement with EMDUniFrac is reflected in Striped UniFrac.

The parallel model of Striped UniFrac is metric agnostic as it is based on how the distance matrix is computed, not the specific metric. In brief, diagonals of the distance matrix are computed in such a way that the diagonals are independent, readily vectorizable, represent identical amount of compute, and only compute a few redundant values when the number of samples is even.

My baseline single threaded testing between Striped UniFrac and QIIME 2's Fast UniFrac (which uses scikit-bio's implementation) is below. These values are from when we were designing the algorithm, so are dated and are not necessarily reflective of the present implementation. The tables were random samples from the American Gut, closed reference 97% against Greengenes.

For transparency, I wrote both the scikit-bio and Striped UniFrac implementations. The scikit-bio Fast UniFrac implementation is compiled like Striped UniFrac, and to the best of my knowledge the is fastest Fast UniFrac implementation available irrespective of package or language.

edit: these values are running against scikit-bio and Striped UniFrac directly, not through QIIME 2.

Number of samples:  256
scikit-bio unweighted seconds:  4.50
scikit-bio weighted unnormalized seconds:  4.43
scikit-bio weighted normalized seconds:  4.90
Striped UniFrac unweighted seconds:  0.75
Striped UniFrac weighted unnormalized seconds:  0.35
Striped UniFrac weighted normalized seconds:  0.58

scikit-bio unweighted MB:  190
scikit-bio weighted unnormalized MB:  192
scikit-bio weighted normalized MB:  192
Striped UniFrac unweighted MB:  10
Striped UniFrac weighted unnormalized MB:  10
Striped UniFrac weighted normalized MB:  10
**************

Number of samples:  512
scikit-bio unweighted seconds:  15.42
scikit-bio weighted unnormalized seconds:  15.16
scikit-bio weighted normalized seconds:  18.26
Striped UniFrac unweighted seconds:  3.91
Striped UniFrac weighted unnormalized seconds:  1.71
Striped UniFrac weighted normalized seconds:  3.08

scikit-bio unweighted MB:  322
scikit-bio weighted unnormalized MB:  323
scikit-bio weighted normalized MB:  324
Striped UniFrac unweighted MB:  23
Striped UniFrac weighted unnormalized MB:  20
Striped UniFrac weighted normalized MB:  21
**************

Number of samples:  1024
scikit-bio unweighted seconds:  68.10
scikit-bio weighted unnormalized seconds:  64.83
scikit-bio weighted normalized seconds:  81.61
Striped UniFrac unweighted seconds:  20.47
Striped UniFrac weighted unnormalized seconds:  8.47
Striped UniFrac weighted normalized seconds:  15.01

scikit-bio unweighted MB:  604
scikit-bio weighted unnormalized MB:  605
scikit-bio weighted normalized MB:  606
Striped UniFrac unweighted MB:  40
Striped UniFrac weighted unnormalized MB:  36
Striped UniFrac weighted normalized MB:  47
**************

Number of samples:  2048
scikit-bio unweighted seconds:  319.91
scikit-bio weighted unnormalized seconds:  315.56
scikit-bio weighted normalized seconds:  392.18
Striped UniFrac unweighted seconds:  95.17
Striped UniFrac weighted unnormalized seconds:  38.80
Striped UniFrac weighted normalized seconds:  76.12

scikit-bio unweighted MB:  1136
scikit-bio weighted unnormalized MB:  1139
scikit-bio weighted normalized MB:  1140
Striped UniFrac unweighted MB:  118
Striped UniFrac weighted unnormalized MB:  71
Striped UniFrac weighted normalized MB:  120
**************

Number of samples:  4096
scikit-bio unweighted seconds:  1449.66
scikit-bio weighted unnormalized seconds:  1430.89
scikit-bio weighted normalized seconds:  1752.27
Striped UniFrac unweighted seconds:  462.89
Striped UniFrac weighted unnormalized seconds:  170.50
Striped UniFrac weighted normalized seconds:  378.02

scikit-bio unweighted MB:  2643
scikit-bio weighted unnormalized MB:  2640
scikit-bio weighted normalized MB:  2644
Striped UniFrac unweighted MB:  418
Striped UniFrac weighted unnormalized MB:  224
Striped UniFrac weighted normalized MB:  421
**************

Number of samples:  8192
scikit-bio unweighted seconds:  6555.79
scikit-bio weighted unnormalized seconds:  6622.93
scikit-bio weighted normalized seconds:  8178.36
Striped UniFrac unweighted seconds:  2579.74
Striped UniFrac weighted unnormalized seconds:  1198.31
Striped UniFrac weighted normalized seconds:  2295.48

scikit-bio unweighted MB:  6101
scikit-bio weighted unnormalized MB:  5801
scikit-bio weighted normalized MB:  5928
Striped UniFrac unweighted MB:  1607
Striped UniFrac weighted unnormalized MB:  820
Striped UniFrac weighted normalized MB:  1608
**************
**************

Daniel_Smith · October 1, 2018, 7:54pm

Hi @wasade,

You were right - compared to beta-phylogenetic-alt, the ssu binary is about 4x faster!

I tried running beta-phylogenetic-alt on EC2 using the latest QIIME2 AMI, but it seems beta-phylogenetic-alt parallel processing is still broken on EC2, as detailed by github issue #211.

I appended the updated results from my local machine below. Do these numbers seem about right?

Thanks,
-Dan

SSU Runtimes on 4,400 Samples (U=Unweighted, W=Weighted):

Threads	Runtime (U)	Speedup (U)	Runtime (W)	Speedup (W)
1	108.2	1.0	35.2	1.0
2	61.6	1.8	23.6	1.5
3	45.6	2.4	20.1	1.8
4	37.7	2.9	19.1	1.8
5	33.2	3.3	19.1	1.8
10	26.5	4.1	20.0	1.8
15	21.5	5.0	15.7	2.2
20	18.7	5.8	14.3	2.5

QIIME 2 Runtimes on 4,400 Samples (U=Unweighted, W=Weighted):

Threads	Runtime (U)	Speedup (U)	Runtime (W)	Speedup (W)
1	223.1	1.000	149.6	1.000
2	175.8	1.269	139.5	1.072
3	158.6	1.406	132.1	1.132
4	150.8	1.479	130.9	1.142
5	146.1	1.527	131.3	1.139
10	139.7	1.597	130.9	1.143
15	133.9	1.666	128.9	1.160
20	130.5	1.709	124.9	1.198

wasade · October 1, 2018, 8:29pm

Thanks, @Daniel_Smith!

Do you have numbers on the performance difference you observe between beta-phylogenetic-alt and the binary directly? Both ultimately rely on the exact same library code. I'd anticipate some overhead with the QIIME2 plugin / artifact machinery but I don't have an intuition.

The benchmarks are still lower than I'd expect. I do wonder whether the general build in conda is just not nearly as optimized as a recompilation. The parallel scaling may be bottoming out and encountering constant overhead? We observed effectively linear scaling out to 16 cores on a single system, however we did need to seat the threads w.r.t. hyper threading boundaries using taskset.

How many OTUs are in the tree being used?

Best,
Daniel

Daniel_Smith · October 3, 2018, 12:07pm

Hi @wasade,

I'm not sure what you mean by "numbers on the performance difference you observe between beta-phylogenetic-alt and the binary directly". I included the measured runtimes in my last post (SSU = calling the binary directly; QIIME 2 = beta-phylogenetic-alt). Are you reffering to a different metric?

In the biom file with 4400 samples, there are 1133 OTUs.

You can give the benchmark script a try on your system if you like:

wget https://www.dropbox.com/s/57a25p2536k0t8r/benchmark.zip?dl=1
unzip benchmark.zip\?dl\=1
for i in {1..20}
do
    for w in w u
    do
        /usr/bin/time -f "qiime2 $i $w %e %M" /bin/sh qiime2.sh $i $w
        /usr/bin/time -f "ssu $i $w %e %M" /bin/sh ssu.sh $i $w
    done
done

I think your software does pretty well personally. I don't doubt you can get it even faster by tweaking the operating system, but for this benchmark I need to use default configurations to have the closest approximation of how a normal user would be running it.

Thanks,

Dan

wasade · October 3, 2018, 5:33pm

Hi @Daniel_Smith, sorry if there was confusion there, I thought your prior results were comparing beta-phylogenetic-alt versus beta-phylogenetic. It does seem like there is some fluff we can trim with the Python interface, thank you for these!

The tree being used is a bit small. This may or may not be relevant for your needs/interests, but the theoretical compute for Striped UniFrac and Fast UniFrac are actually identical at O(n^2 * m) where n is the number of samples and m is the number of vertices in the tree. The real difference is in the memory consumption, which is O(n^2 + n * m) for Fast UniFrac and roughly O(n^2 + n * log(m)). For modest trees, particularly fragment insertion ones, the difference will will be appreciable.

For clarification, I did not mean to suggest changes to your operating system, configurations, or anything which requires administrative level permissions. Generic binaries from bioconda have known performance issues for other parts of the QIIME 2 ecosystem (e.g., q2-dada). Unfortunately, we don't have a good general work around yet other than suggesting recompilation, if it is a concern. And, it is plausible that any compute or memory bound threaded application, not just Striped UniFrac, would benefit from selection and seating of specific cores. This can be done without recompilation with the user land tool taskset. For example, taskset -c 0-31:2 qiime diversity beta-phylogenetic-alt ... would fix the process to cores 0, 2, 4, etc.

You can get deep into the weeds on this stuff quickly though. Since you mentioned this was for a manuscript, it may be defensive to note that the stats are based off "out-of-the-box" use. Describing this use as "normal" may be unsubstantiated: we don't track what users do in their environments so we don't really know, and it's not uncommon for system administrators to invest effort to optimize libraries for their hardware particularly on shared resources.

On an aside, GNU Time 1.7 had a bug in which maximum resident memory was reported 4x the real value. This was the default binary for Centos and RHEL6, so was common to encounter in the field. May be worth a quick /usr/bin/time --version check.

Best,
Daniel

Daniel_Smith · October 4, 2018, 1:45am

Thanks for the great tips @wasade !

My RHEL6 system is indeed effected by the GNU Time 1.7 bug you pointed out. I'm really glad you caught that! Graphs with the corrected memory usage are below.

I gave taskset a go and didn't observe any notable difference in runtime. That's an interesting trick though, and one I hadn't heard of before.

I'd really like to benchmark beta-phylogenetic-alt and ssu in the best light possible, and replicate the linear scaling you see. Are there instructions out there on how to recompile QIIME 2 binaries?

Good points about being careful in wording choice and looking more into tree size. I'm downloading the American Gut dataset now to see how an OTU table with 10x more OTUs impacts the relative runtimes and memory consumptions.

Thanks again for your help and insights!

Dan

wasade · October 4, 2018, 3:35am

Nice!

Some plugins are pure Python, some use other languages, so it is on a per plugin basis right now. For UniFrac, it would entail recompiling, and reinstalling the library itself from source (which should be feasible through setup.py within your Q2 environment assuming dependencies are in place).

Best,
Daniel