Is there something unusual about how the TMPDIR is being set in 2023.2?

Lichen · May 10, 2023, 4:00pm

Hello,

I'm running into an issue related to how the TMPDIR is being set in QIIME 2 2023.2.

Here is the code I am running:

source activate qiime2-2023.2

set -x
set -e

export TMPDIR=/panfs/jpshaffer/tmp2

qiime diversity beta-phylogenetic \
  --i-table /projects/emp/emp-allergy/data/emp_16s_release2_qiita_with_release1_90bp_gg2_asv_rar5k.qza \
  --p-metric 'weighted_unifrac' \
  --p-threads 60 \
  --i-phylogeny /databases/gg/2022.10/2022.10.phylogeny.asv.nwk.qza \
  --o-distance-matrix /projects/emp/emp-allergy/data/emp_16s_release2_qiita_with_release1_90bp_gg2_asv_rar5k_dist_wunifrac.qza \
  --verbose

Here is the error message:

+ set -e
+ export TMPDIR=/panfs/jpshaffer/tmp2
+ TMPDIR=/panfs/jpshaffer/tmp2
+ qiime diversity beta-phylogenetic --i-table /projects/emp/emp-allergy/data/emp_16s_release2_qiita_with_release1_90bp_gg2_asv_rar5k.qza --p-metric weighted_unifrac --p-threads 60 --i-phylogeny /databases/gg/2022.10/2022.10.phylogeny.asv.nwk.qza --o-distance-matrix /projects/emp/emp-allergy/data/emp_16s_release2_qiita_with_release1_90bp_gg2_asv_rar5k_dist_wunifrac.qza --verbose
Traceback (most recent call last):
  File "/home/jpshaffer/software/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/q2cli/commands.py", line 352, in __call__
    results = action(**arguments)
  File "<decorator-gen-111>", line 2, in beta_phylogenetic
  File "/home/jpshaffer/software/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 234, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/home/jpshaffer/software/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 475, in _callable_executor_
    outputs = self._callable(scope.ctx, **view_args)
  File "/home/jpshaffer/software/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/q2_diversity/_beta/_pipeline.py", line 31, in beta_phylogenetic
    dm, = action(table, phylogeny, threads=threads,
  File "<decorator-gen-512>", line 2, in weighted_unifrac
  File "/home/jpshaffer/software/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 234, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/home/jpshaffer/software/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 405, in _callable_executor_
    prov = provenance.fork(name)
  File "/home/jpshaffer/software/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/core/archive/provenance.py", line 442, in fork
    forked = super().fork()
  File "/home/jpshaffer/software/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/core/archive/provenance.py", line 342, in fork
    forked._build_paths()
  File "/home/jpshaffer/software/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/core/archive/provenance.py", line 142, in _build_paths
    self.path = qiime2.core.path.ProvenancePath()
  File "/home/jpshaffer/software/miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/core/path.py", line 146, in __new__
    path = tempfile.mkdtemp(prefix=prefix)
  File "/home/jpshaffer/software/miniconda3/envs/qiime2-2023.2/lib/python3.8/tempfile.py", line 358, in mkdtemp
    _os.mkdir(file, 0o700)
OSError: [Errno 28] No space left on device: '/tmp/qiime2-provenance-pme9mvn2'

Plugin error from diversity:

  [Errno 28] No space left on device: '/tmp/qiime2-provenance-pme9mvn2'

See above for debug info.
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command:

ssu -i /tmp/qiime2/jpshaffer/data/c4d2bb02-951c-4737-9f99-c05c2038d05e/data/feature-table.biom -t /tmp/qiime2/jpshaffer/data/1d6fd745-9191-448c-9066-6b754e53a272/data/tree.nwk -m weighted_unnormalized -o /tmp/q2-LSMatFormat-uxju8cte

It appears QIIME 2 is not suing the TMPDIR what was defined in the job script.

Thanks in advance for any insight.

ebolyen · May 10, 2023, 4:27pm

Hi @Lichen!

Good observation, what you did should have worked (and certainly has worked in the past).

Just to check how other programs treat it, could you run:

mktemp -u

in the same context as your command (post env-var setting)?

What's weird about this is we are just using Python's standard library ( tempfile.mkdtemp) to resolve the temp dir, so I really don't have a good suggestion yet.

Lichen · May 10, 2023, 4:51pm

Thank you!

Here is the output from running 'mktemp -u':

+ mktemp -u
/panfs/jpshaffer/tmp2/tmp.yxYnLJPmXa

Thanks again!

ebolyen · May 10, 2023, 4:55pm

Well darn, that looks perfect.

Are you running these commands through a queue-ing system, and if so, was the mktemp also run through the same queue-ing system?

Lichen · May 10, 2023, 5:05pm

Yes, the error is from running on a queue-ing system. However, the 'mktemp -u' test I just performed was in an interactive job on the same system. I have just kicked off a new job in the queue-ing system as the initial job that failed, including the 'mktemp -u' test, and will follow up once that fails or completes.

Thanks again!

ebolyen · May 15, 2023, 7:23pm

Hey @Lichen,

Just wanted to check if you found the source of the issue?

Thanks!
-Evan

Lichen · May 15, 2023, 7:47pm

Thank you! It looks like our /tmp/ was cleaned up, and the job was able to complete based on that generating enough available space. Bummer! I was hoping to reproduce the problem. I will be sure to re-post if it comes up again, but let me know if you'd like to investigate further.

Mechah · May 16, 2023, 5:02pm

Hi Justin and Evan,
we also had trouble executing our standard SLURM qiime2-2023.2 scripts on our HPC system. Our solution was to empty the TMPDIR prior to any executions by including something like this in the bash script (before activating, running and deactivating the conda qiime2-2023.2 environment):

#!/bin/bash

#SBATCH --job-name=qiime2
#SBATCH --cpus-per-task=2
#SBATCH --output=log/qiime2-%j.out
#SBATCH --error=log/qiime2-%j.err

hostname
rm -r $TEMP/qiime2

Maybe it is of any applicability to you or other QIIME2 users...

ebolyen · May 24, 2023, 9:41pm

Hi @Mechah,

That's a good workaround, but it's definitely not our goal to need such a thing.
Would you be able to share any details about what specifically wasn't working?

Perhaps there's something we could be doing better here.

Mechah · June 16, 2023, 10:49am

Hi and sorry for this late reply...
I had a quick chat with our HPC admins and maybe these comments might be of any use?

"...The default values for the system are (I believe) $TMPDIR, $TEMPDIR (which point to the system's /tmp volume), and $TMP and $TEMP (pointing to /media/temp/your_user_id). Therefore, if two users of Qiime were using it at the same time and I believe qiime2-2023.2 was utilizing the first two, there was a permissions collision..."

ebolyen · June 26, 2023, 4:16pm

Hi @Mechah,

That is pretty helpful, it sounds like your system is making a strong distinction between location and environment variables, which is a little strange as usually there's no useful difference between TMP and TMPDIR (at least in my experience).

It's worth noting that we prefix this temporary cache by the username (or UID failing that), so there shouldn't be any collisions except a really rare and unavoidable race condition if two accounts were creating the cache for the very very first time on the host as both accounts would be attempting to set the sticky bit on the directory to allow other users to write to their own specialized directory (avoiding this issue from that point on).

Ironically, deleting the $TMPDIR/qiime2 each time makes that more likely to occur as the above dance will happen each new instantiation of it. Also, by deleting that directory, it is possible to delete active WIP from another user who has been scheduled on the same node, typically this will corrupt the artifact so completely that nothing other than a long trace back will occur, but it's conceivable it could happen at just the right time to produce an artifact with missing data.

system · July 27, 2023, 10:16pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.