Hi @tashoo,
I think that is going to be the issue. I initially poured through the docker log and annotated some interesting things which I'll leave below, for anyone who might run into similar things.
Looking into it a bit more, it looks like Docker knows about Rosetta already, but to indicate that a given container needs the emulation layer you have to pass the --platform linux/amd64
flag. If you are using docker-desktop there may be an advanced setting somewhere for that, but I'm not confident. You may need to run the image on the command line to test if the above is going to be viable.
Hopefully that gives you a direction to start poking, but the main issue here is the processor is just a different architecture than the image expects, so it's kind of surprising it worked as well as it did. Perhaps Rosetta is already running under the hood but there's still some incompatibility lurking? I can't really say for sure.
ORIGINAL REPLY:
Thank you for the docker log, that is really helpful. I'm going to pull out a few pieces that seem interesting, though I don't yet know how significant they are to your problem.
==> /home/galaxy/logs/slurmd.log <==\
[2022-08-06T00:41:30.214] slurmd version 17.11.2 started\
[2022-08-06T00:41:30.217] slurmd started on Sat, 06 Aug 2022 00:41:30 +0000\
[2022-08-06T00:41:30.221] CPUs=4 Boards=1 Sockets=1 Cores=4 Threads=1 Memory=7851 TmpDisk=59819 Uptime=1007 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)\
[2022-08-08T23:24:13.098] error: Domain socket directory /var/spool/slurmd: No such file or directory\
[2022-08-08T23:24:13.116] Node reconfigured socket/core boundaries SocketsPerBoard=4:1(hw) CoresPerSocket=1:4(hw)\
[2022-08-08T23:24:13.117] Message aggregation disabled\
[2022-08-08T23:24:13.122] CPU frequency setting not configured for this node\
[2022-08-08T23:24:13.129] slurmd version 17.11.2 started\
[2022-08-08T23:24:13.134] slurmd started on Mon, 08 Aug 2022 23:24:13 +0000\
[2022-08-08T23:24:13.138] CPUs=4 Boards=1 Sockets=1 Cores=4 Threads=1 Memory=7851 TmpDisk=59819 Uptime=1124 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)\
Some basic info from SLURM (the job scheduler). This all seems perfectly fine. The TmpDisk
might be low, but it would depend on what the units are on that number and I'm not familiar enough with the configuration to suggest a change there anyway.
After that initial setup, we see a string of these:
==> /home/galaxy/logs/uwsgi.log <==\
Mon Aug 8 23:24:13 2022 - *** uWSGI listen queue of socket "127.0.0.1:4001" (fd: 6) full !!! (28770288/64) ***\
Mon Aug 8 23:24:14 2022 - *** uWSGI listen queue of socket "127.0.0.1:4001" (fd: 6) full !!! (28770288/64) ***\
Mon Aug 8 23:24:15 2022 - *** uWSGI listen queue of socket "127.0.0.1:4001" (fd: 6) full !!! (28770288/64) ***\
Mon Aug 8 23:24:16 2022 - *** uWSGI listen queue of socket "127.0.0.1:4001" (fd: 6) full !!! (28770288/64) ***\
Mon Aug 8 23:24:17 2022 - *** uWSGI listen queue of socket "127.0.0.1:4001" (fd: 6) full !!! (28770288/64) ***\
Mon Aug 8 23:24:18 2022 - *** uWSGI listen queue of socket "127.0.0.1:4001" (fd: 6) full !!! (28770288/64) ***\
Mon Aug 8 23:24:19 2022 - *** uWSGI listen queue of socket "127.0.0.1:4001" (fd: 6) full !!! (28770288/64) ***\
Mon Aug 8 23:24:20 2022 - *** uWSGI listen queue of socket "127.0.0.1:4001" (fd: 6) full !!! (28770288/64) ***\
Mon Aug 8 23:24:21 2022 - *** uWSGI listen queue of socket "127.0.0.1:4001" (fd: 6) full !!! (28770288/64) ***\
Mon Aug 8 23:24:22 2022 - *** uWSGI listen queue of socket "127.0.0.1:4001" (fd: 6) full !!! (28770288/64) ***\
Mon Aug 8 23:24:23 2022 - *** uWSGI listen queue of socket "127.0.0.1:4001" (fd: 6) full !!! (28770288/64) ***\
Mon Aug 8 23:24:24 2022 - *** uWSGI listen queue of socket "127.0.0.1:4001" (fd: 6) full !!! (28770288/64) ***\
Which definitely seems wrong and might be the source of the issue, although I don't have a solution yet. I'm also not quite certain why the incoming end of the uWSGI socket would end up full, but it could explain why the job failed if uWSGI is potentially dropping information that the worker(s) needed.
==> /home/galaxy/logs/handler1.log <==\
galaxy.jobs DEBUG 2022-08-08 23:24:42,690 [pN:handler1,p:255,tN:SlurmRunner.work_thread-0] Job wrapper for Job [101] prepared (439.562 ms)\
galaxy.jobs.command_factory INFO 2022-08-08 23:24:42,751 [pN:handler1,p:255,tN:SlurmRunner.work_thread-0] Built script [/export/galaxy-central/database/job_working_directory/000/101/tool_script.sh] for tool command [python '/galaxy-central/tools/data_source/upload.py' '/galaxy-central' '/export/galaxy-central/database/job_working_directory/000/101/registry.xml' '/export/galaxy-central/database/job_working_directory/000/101/upload_params.json' '101:/export/galaxy-central/database/job_working_directory/000/101/working/dataset_101_files:/export/galaxy-central/database/files/000/dataset_101.dat']\
galaxy.tool_util.deps DEBUG 2022-08-08 23:24:42,876 [pN:handler1,p:255,tN:SlurmRunner.work_thread-0] Using dependency bcftools version 1.5 of type conda\
galaxy.jobs.runners DEBUG 2022-08-08 23:24:42,884 [pN:handler1,p:255,tN:SlurmRunner.work_thread-0] (101) command is: mkdir -p working outputs configs\
if [ -d _working ]; then\
rm -rf working/ outputs/ configs/; cp -R _working working; cp -R _outputs outputs; cp -R _configs configs\
else\
cp -R working _working; cp -R outputs _outputs; cp -R configs _configs\
fi\
cd working; /bin/bash /export/galaxy-central/database/job_working_directory/000/101/tool_script.sh > ../outputs/tool_stdout 2> ../outputs/tool_stderr; return_code=$?; cd '/export/galaxy-central/database/job_working_directory/000/101'; \
[ "$GALAXY_VIRTUAL_ENV" = "None" ] && GALAXY_VIRTUAL_ENV="$_GALAXY_VIRTUAL_ENV"; _galaxy_setup_environment True\
export PATH=$PATH:'/export/tool_deps/_conda/envs/__bcftools@1.5/bin' ; python "metadata/set.py"; sh -c "exit $return_code"\
galaxy.jobs.runners.drmaa DEBUG 2022-08-08 23:24:42,966 [pN:handler1,p:255,tN:SlurmRunner.work_thread-0] (101) submitting file /export/galaxy-central/database/job_working_directory/000/101/galaxy_101.sh\
galaxy.jobs.runners.drmaa DEBUG 2022-08-08 23:24:42,966 [pN:handler1,p:255,tN:SlurmRunner.work_thread-0] (101) native specification is: --ntasks=1 --share\
galaxy.jobs.runners.drmaa INFO 2022-08-08 23:24:43,017 [pN:handler1,p:255,tN:SlurmRunner.work_thread-0] (101) queued as 2\
galaxy.jobs.runners.drmaa DEBUG 2022-08-08 23:24:43,190 [pN:handler1,p:255,tN:SlurmRunner.monitor_thread] (101/2) state change: job is queued and active\
\
Here we see the job has been setup and submitted, since there's a bunch of scripts happening in particular we see [python '/galaxy-central/tools/data_source/upload.py' ...
which is the upload tool.
==> /home/galaxy/logs/handler1.log <==\
galaxy.jobs.runners.drmaa DEBUG 2022-08-08 23:24:44,226 [pN:handler1,p:255,tN:SlurmRunner.monitor_thread] (101/2) state change: job finished, but failed\
galaxy.jobs.runners.slurm WARNING 2022-08-08 23:24:44,366 [pN:handler1,p:255,tN:SlurmRunner.monitor_thread] (101/2) Job failed due to unknown reasons, job state in SLURM was: FAILED\
This one is just SLURM noticing that the job failed (through the DRMAA library), not a real explanation, but it probably wasn't SLURM's fault, since it's the one relaying the message.
After that we see some cleanup fail as presumably these directories were not created as expected:
==> /home/galaxy/logs/handler1.log <==\
galaxy.tools.error_reports DEBUG 2022-08-08 23:24:44,760 [pN:handler1,p:255,tN:SlurmRunner.work_thread-1] Bug report plugin <galaxy.tools.error_reports.plugins.sentry.SentryPlugin object at 0x406b57b550> generated response None\
galaxy.jobs.runners DEBUG 2022-08-08 23:24:44,805 [pN:handler1,p:255,tN:SlurmRunner.work_thread-1] (101/2) Unable to cleanup /export/galaxy-central/database/job_working_directory/000/101/galaxy_101.sh: [Errno 2] No such file or directory: '/export/galaxy-central/database/job_working_directory/000/101/galaxy_101.sh'\
galaxy.jobs.runners DEBUG 2022-08-08 23:24:44,819 [pN:handler1,p:255,tN:SlurmRunner.work_thread-1] (101/2) Unable to cleanup /export/galaxy-central/database/job_working_directory/000/101/galaxy_101.o: [Errno 2] No such file or directory: '/export/galaxy-central/database/job_working_directory/000/101/galaxy_101.o'\
galaxy.jobs.runners DEBUG 2022-08-08 23:24:44,836 [pN:handler1,p:255,tN:SlurmRunner.work_thread-1] (101/2) Unable to cleanup /export/galaxy-central/database/job_working_directory/000/101/galaxy_101.e: [Errno 2] No such file or directory: '/export/galaxy-central/database/job_working_directory/000/101/galaxy_101.e'\
galaxy.jobs.runners DEBUG 2022-08-08 23:24:44,852 [pN:handler1,p:255,tN:SlurmRunner.work_thread-1] (101/2) Unable to cleanup /export/galaxy-central/database/job_working_directory/000/101/galaxy_101.ec: [Errno 2] No such file or directory: '/export/galaxy-central/database/job_working_directory/000/101/galaxy_101.ec'\