I'm running benchmark tests on several UniFrac implementations as part of a manuscript I'm writing, and wanted to share my results for QIIME 2 to confirm with it's developers that I'm seeing what they'd expect. If anything looks amiss please let me know. Thanks!
I'm using qiime2-2018.6 on a Xeon E5-2670 v2 2.50GHz machine with 256 GB ram and 20 physical cores running Redhat natively. All I/O is being done to/from a ram drive. QIIME 2 is being invoked using the following script, the entirety of which is timed:
where $1 is the number of threads, $2 is "u" or "w", and $3 is the dataset size (number of samples, all pre-selected from the HMP 16S dataset and pre-rarefied to 1000 reads/sample).
If you have any suggestions for modifying the above script in order to decrease the wall clock time required to go from a biom and newick file to a tsv matrix of UniFrac distances, I'd be more than happy to implement them.
Thanks again for all your team's great work on this software!
Those curves differ from what I’d expect with Striped UniFrac (beta-phylogenetic-alt). I’m particularly concerned about the lack of a difference in runtime with multiple threads for weighted UniFrac. Can you confirm that, when run, that there actually is increased processor utilization with increasing numbers of threads? If not, there may be a bug.
As a sanity check, you can use the Striped UniFrac algorithm directly using the binary ssu that’s in your environment if you want to execute independent of QIIME 2 for testing. It’s also possible to execute directly via Python (e.g., import unifrac; unifrac.unweighted(...)).
A possibly important point, the library in conda is pre-compiled, and is generic. The library does rely on compile time auto vectorization, which can be compiler and architecture specific. In other words, the compiled shared library being indirectly used may not be optimized for your architecture.
As a point of context, I can compute unweighted UniFrac on the Earth Microbiome Project on my laptop using four cores in less than a day when running directly against the binary (that was on a mid 2016 Macbook Pro). That dataset is 27k samples and the phylogeny has around 320k tips. We also see near linear parallel scaling out to 16 cores if hyperthreading is respected. Though we have observed some potential confounding with non-uniform memory access (NUMA) boundaries.
@colinbrislawn, EMDUniFrac is related to Striped UniFrac in the postorder reduction performed, but they differ substantially in how samples are compared, and in how they can be parallelized. We are working with Prof. Koslicki on a manuscript at the moment on the new algorithm. Striped UniFrac can be found here, which is the library behind beta-phylogenetic-alt.
Yikes, that was an oversight on my part. Since beta-diversity didn’t support parallelized computation of weighted UniFrac, I’d assumed beta-diversity-alt didn’t either. I just tested it though and it definitely does. I’ll re-run those and post an update early next week with the new numbers.
I was surprised also that unweighted UniFrac didn’t show better scaling with additional cores. I’ll give it a run on AWS using the QIIME 2 AMI to see if it’s just my particular environment it didn’t like.
@colinbrislawn, thanks for pointing me to EMDUniFrac! It wasn’t on my radar before but I’ll be including it now.
To clarify, the principle improvement with EMDUniFrac is reflected in Striped UniFrac.
The parallel model of Striped UniFrac is metric agnostic as it is based on how the distance matrix is computed, not the specific metric. In brief, diagonals of the distance matrix are computed in such a way that the diagonals are independent, readily vectorizable, represent identical amount of compute, and only compute a few redundant values when the number of samples is even.
My baseline single threaded testing between Striped UniFrac and QIIME 2’s Fast UniFrac (which uses scikit-bio’s implementation) is below. These values are from when we were designing the algorithm, so are dated and are not necessarily reflective of the present implementation. The tables were random samples from the American Gut, closed reference 97% against Greengenes.
For transparency, I wrote both the scikit-bio and Striped UniFrac implementations. The scikit-bio Fast UniFrac implementation is compiled like Striped UniFrac, and to the best of my knowledge the is fastest Fast UniFrac implementation available irrespective of package or language.
edit: these values are running against scikit-bio and Striped UniFrac directly, not through QIIME 2.
You were right - compared to beta-phylogenetic-alt, the ssu binary is about 4x faster!
I tried running beta-phylogenetic-alt on EC2 using the latest QIIME2 AMI, but it seems beta-phylogenetic-alt parallel processing is still broken on EC2, as detailed by github issue #211.
I appended the updated results from my local machine below. Do these numbers seem about right?
Do you have numbers on the performance difference you observe between beta-phylogenetic-alt and the binary directly? Both ultimately rely on the exact same library code. I’d anticipate some overhead with the QIIME2 plugin / artifact machinery but I don’t have an intuition.
The benchmarks are still lower than I’d expect. I do wonder whether the general build in conda is just not nearly as optimized as a recompilation. The parallel scaling may be bottoming out and encountering constant overhead? We observed effectively linear scaling out to 16 cores on a single system, however we did need to seat the threads w.r.t. hyper threading boundaries using taskset.
I’m not sure what you mean by “numbers on the performance difference you observe between beta-phylogenetic-alt and the binary directly”. I included the measured runtimes in my last post (SSU = calling the binary directly; QIIME 2 = beta-phylogenetic-alt). Are you reffering to a different metric?
In the biom file with 4400 samples, there are 1133 OTUs.
You can give the benchmark script a try on your system if you like:
wget https://www.dropbox.com/s/57a25p2536k0t8r/benchmark.zip?dl=1
unzip benchmark.zip\?dl\=1
for i in {1..20}
do
for w in w u
do
/usr/bin/time -f "qiime2 $i $w %e %M" /bin/sh qiime2.sh $i $w
/usr/bin/time -f "ssu $i $w %e %M" /bin/sh ssu.sh $i $w
done
done
I think your software does pretty well personally. I don’t doubt you can get it even faster by tweaking the operating system, but for this benchmark I need to use default configurations to have the closest approximation of how a normal user would be running it.
Hi @Daniel_Smith, sorry if there was confusion there, I thought your prior results were comparing beta-phylogenetic-alt versus beta-phylogenetic. It does seem like there is some fluff we can trim with the Python interface, thank you for these!
The tree being used is a bit small. This may or may not be relevant for your needs/interests, but the theoretical compute for Striped UniFrac and Fast UniFrac are actually identical at O(n^2 * m) where n is the number of samples and m is the number of vertices in the tree. The real difference is in the memory consumption, which is O(n^2 + n * m) for Fast UniFrac and roughly O(n^2 + n * log(m)). For modest trees, particularly fragment insertion ones, the difference will will be appreciable.
For clarification, I did not mean to suggest changes to your operating system, configurations, or anything which requires administrative level permissions. Generic binaries from bioconda have known performance issues for other parts of the QIIME 2 ecosystem (e.g., q2-dada). Unfortunately, we don’t have a good general work around yet other than suggesting recompilation, if it is a concern. And, it is plausible that any compute or memory bound threaded application, not just Striped UniFrac, would benefit from selection and seating of specific cores. This can be done without recompilation with the user land tool taskset. For example, taskset -c 0-31:2 qiime diversity beta-phylogenetic-alt ... would fix the process to cores 0, 2, 4, etc.
You can get deep into the weeds on this stuff quickly though. Since you mentioned this was for a manuscript, it may be defensive to note that the stats are based off “out-of-the-box” use. Describing this use as “normal” may be unsubstantiated: we don’t track what users do in their environments so we don’t really know, and it’s not uncommon for system administrators to invest effort to optimize libraries for their hardware particularly on shared resources.
On an aside, GNU Time 1.7 had a bug in which maximum resident memory was reported 4x the real value. This was the default binary for Centos and RHEL6, so was common to encounter in the field. May be worth a quick /usr/bin/time --version check.
My RHEL6 system is indeed effected by the GNU Time 1.7 bug you pointed out. I'm really glad you caught that! Graphs with the corrected memory usage are below.
I gave taskset a go and didn't observe any notable difference in runtime. That's an interesting trick though, and one I hadn't heard of before.
I'd really like to benchmark beta-phylogenetic-alt and ssu in the best light possible, and replicate the linear scaling you see. Are there instructions out there on how to recompile QIIME 2 binaries?
Good points about being careful in wording choice and looking more into tree size. I'm downloading the American Gut dataset now to see how an OTU table with 10x more OTUs impacts the relative runtimes and memory consumptions.
Some plugins are pure Python, some use other languages, so it is on a per plugin basis right now. For UniFrac, it would entail recompiling, and reinstalling the library itself from source (which should be feasible through setup.py within your Q2 environment assuming dependencies are in place).