bioenv long run time?

ange · June 19, 2019, 3:35pm

Hi,

I am trying to run bioenv on 14 samples with 60-65 columns of metadata (already filtered down). My job seems to be taking too long; it has been running for more than 24 hours with 12 cores-- I was wondering if this is normal for bioenv runs?

Thank you!

thermokarst · June 20, 2019, 4:59pm

Hey there @ange, I found this note in the documentation:

Warning: This method can take a long time to run if a large number of variables are specified, as all possible subsets are evaluated at each subset size.

You don't have many samples, but you do have quite a few metadata columns. Maybe try running it on fewer metadata columns?

I don't think this action is able to use multiple cores, so, the other 11 are being wasted right now.

ange · June 20, 2019, 5:13pm

Ah thank you! I did see that note at some point but didn't know that 'can take a long time' meant 3 days and counting. Don't tell our cluster manager about me wasting cores D-;

Fabian · September 26, 2019, 11:34am

Also experiencing bioenv leading to more than 10 cores with full load (actually about 50% of the cores).

If these are not utilized, how to avoid them being loaded by bioenv?
"long time" is something to personally handle via number of metadata columns.
Loading 11+ cores to 100% without using them is close to a "Stop using bioenv in the qiime2 framework"

Cheers!

thermokarst · September 26, 2019, 1:28pm

@Fabian,

I don't think this usage can be attributed to bioenv --- it is not physically possible for this action to use more than one core at a time. I would double-check that you aren't running something else in the background that is utilizing those cores.

Fabian · September 26, 2019, 3:26pm

First of all thanks for the super quick reply and for the great community support!
Using Qiime2 for more than a year now passively lurking the forums and it is the first time I needed to actually create a post.

I did check what you are indicating before posting actually and could reproduce the load:

Shell instance 1 --> activate conda env --> Run Command --> 50% of CPUs under load (12 cores (24 in hyper threading) are at 90%+)

Shell instance 2 --> activate conda env --> Run Command --> 100% of CPUs under load (24 cores (48 in hyper threading) are at 100%)

htop is currently listing all cores at 100% and in the list of jobs that take CPU usage:
PID ... CPU .... command
48906 ... ~2350% ... qiime
44028 ... ~2350% ... qiime

So despite physical impossibility, qiime takes 4600%+ CPU usage total when running 2 instances of bioenv with a big (like REALLY big for bioenv: 180+ samples and 50+ meta data entries) dataset.

If there are some special logfiles that could help enlighten the situation, I can happily send/check them of course.

Cheers!

ange · September 30, 2019, 3:54pm

Hi,

Just to contribute to this discussion--
I've tried using vegan::bioenv in R on exported qiime outputs-- I don't know about efficiency but I know the one implemented in R allows parallelisation.

Also, I don't know if you've ever tried bvStep, but it has been a proposed alternative to bioenv-- it does a stepwise search instead of an upfront full search, which supposedly makes it more efficient than bioenv esp. for large datasets.

http://menugget.blogspot.com/2011/06/clarke-and-ainsworths-bioenv-and-bvstep.html
https://rdrr.io/github/marchtaylor/sinkr/man/bvStep.html

thermokarst · September 30, 2019, 4:00pm

Thanks @Fabian, not sure what to tell you --- this method is single-threaded, perhaps your computer is under load from some other process? Anyway, just wanted to point out that behind the scenes q2-diversity is using scikit-bio for this statistic, check out the manpage here:

http://scikit-bio.org/docs/latest/generated/skbio.stats.distance.bioenv.html#skbio.stats.distance.bioenv

Of particular relevance to what you're working on:

Fabian · September 30, 2019, 6:31pm

Thanks for the ongoing discussion!
I am fine with taking longer time to run (as long as it is possible to limit cpu/ram, week ranges are acceptable).
I came here from a search for the number of cores used because I observed this mentioned behaviour and was curious.
Nevertheless, I am looking forward to testing/man-reading the different packages you and ange mentioned and hope that I can make one of them (or the launch from qiime) work properly on big datasets.

Back to the original problem:
It was really likely qiime2:
when checking cpu usage with top and htop on my system:
-top had multiple qiime entries in highest cpu usage while I only started one/two qiime bioenv jobs.
-htop listed qiime with 4 digit cpu usage values
-all other processes were negelectable in % cpu

When I am at work tomorrow I will try to reproduce and screenshot this with my dataset against my local qiime2 version and against a fresh conda env & newest qiime2.

thermokarst · September 30, 2019, 7:14pm

Its possible that what you are seeing is related to click, the cli tool running behind the scenes in q2cli. If possible, try running the bioenv visualizer using the Artifact API, this will give you a more "pure" test.

Fabian · October 1, 2019, 3:40pm

I will do that within the next days an report back when I got all results, thanks for that suggestion.

Finally a reason to use python again after that nice qiime cli convenience.

Fabian · October 2, 2019, 6:41am

Ok....
I just tested a bit with the CLI and with the python Artifact API (see below, redacted pathnames):
a) qiime 2018.08
CLI:
50% of cores are started
Artifcat API:
50% of cores are started
b)qiime2 2019-07 release
CLI:
1 core as desired (skipped the Artifact Api at this point)

So in the end it was some old version thing I faced due to wanting to stick with the older version for an older analysis started with version 2018-08.
Sorry for the fuzz though and thanks a lot for all the input!

ange · October 2, 2019, 1:27pm

Hey, so in the end are you still planning to run your job in qiime2 and wait it out? If so can you let me know how long it takes you? At one point I just killed my job and moved on to other platforms because it was taking too much time.

Fabian · October 2, 2019, 2:07pm

yes, as long as I can spare that 1 core I will keep it running for test reasons.
That many cores with unclear usage were not worth it for me.
Since the typical dataset here has such size, I could imagine running parallel bioenv calculations and wait that time.

Fabian · October 16, 2019, 6:32am

Currently at 334 hours and pending.

colinbrislawn · October 17, 2019, 4:33pm

Is that CPU hours or wall-clock hours (~14 days?)

~~Something is funky here!!~~
EDIT: I've been reminded that this method can take a very long time, especially when you have lots of metadata variables like you do. You might have to wait a litte longer...