qiime tools import on cluster

nora · January 7, 2021, 12:17pm

Dear Matt,

before this thread get automatically closed, could you suggest which debugging scripts to run, as you mentioned earlier?

Best regards,
Nora

thermokarst · January 7, 2021, 2:04pm

Thanks for the bump @nora, I won't let this thread close, no worries. I am teaching a QIIME 2 workshop this week (and was out of the office the two weeks before that), so won't have time to write the script for you until next week. More soon.

thermokarst · March 12, 2021, 8:35pm

Hey there @nora, just wanted to let you know that this hasn't slipped off my radar. I have good news, I was finally able to reproduce this bug - it appears to be tied to certain networked file systems - we have only seen reports of this on beegfs filesystems (which is what your scratch mount is). Anyway, now that I've been able to reproduce this, we have some potential workarounds we are playing with. My current plan is to include a fix for this in the upcoming 2021.4 release. In the meantime, the only solution I can offer you is to not use that beegfs partition, if possible. If you want to see our development discussion, here it is:

github.com/qiime2/qiime2

bug: networked filesystem race conditions

qiime2:master ←

opened 04:21PM - 12 Mar 21 UTC

thermokarst

+106 -7

This changeset attempts to address some longstanding issues we've seen crop up o…n the forum from time to time: - https://forum.qiime2.org/t/qiime-tools-import-on-cluster/17531 - https://forum.qiime2.org/t/error-while-runing-dada2/16600 - https://forum.qiime2.org/t/qiime2-installation-and-run-error/4766 - https://forum.qiime2.org/t/out-of-disk-space-error/5736 This has been a tricky one to debug though, since it seems to be specifically tied to the [BeeGFS filesystem](https://en.wikipedia.org/wiki/BeeGFS). I finally figured out how to reproduce this error though, using the following repo (plus patch; I also had to manually rebuild the beegfs client and launch the system service): https://github.com/jeremysj/beegfs-sandbox/ ```patch diff --git a/roles/beegfs_client/tasks/main.yml b/roles/beegfs_client/tasks/main.yml index 5ea86fa..6ed0ec9 100644 --- a/roles/beegfs_client/tasks/main.yml +++ b/roles/beegfs_client/tasks/main.yml @@ -6,7 +6,8 @@ - name: "BeeGFS client packages" yum: name="{{ item }}" state=present with_items: - - "kernel-devel-{{ ansible_kernel }}" + - kernel + - kernel-devel - gcc - beegfs-client - beegfs-helperd ``` So, after poking around a bit, I can't find anything obviously wrong with the framework's handling of these temp paths - no double-creation of dirs/files, etc. It really only seems to be an issue when setting your machine's `TMPDIR` to a beegfs mount. One thing I learned while poking around at things is that [`pathlib.Path.rename`](https://docs.python.org/3.6/library/pathlib.html#pathlib.Path.rename) is a thin wrapper around [`os.rename`](https://docs.python.org/3.6/library/os.html#os.rename). Check out this `os.rename` note in the docs: > The operation may fail on some Unix flavors if src and dst are on different filesystems. I suppose we have been warned. Anyway, other solutions I looked at included replacing `rename` with `replace`, but I saw identical failures. I haven't written any tests for this yet, I wanted to put this out there into the world (cc @ebolyen) to see what folks think. This changeset takes the approach of attempting to perform the lower-cost rename operations, and if it fails due to this `FileExistsError`, falls back to a slightly more expensive copy operation. I am open to other suggestions, just let me know. PS - thought about trying to use `qiime2.util.duplicate` in here, but steered clear for now.

Thanks for being patient and lending a hand on this!

:qiime2:

nora · March 16, 2021, 8:34am

Dear Matt,

thanks a lot for the update, I will follow the thread.

Best,
Nora

alison · March 19, 2021, 1:41pm

I'm having this same problem, although with cutadapt demux (our cluster also uses the BeeGFS scratch file system). Has this been fixed? Or a work-around?

thermokarst · March 19, 2021, 3:25pm

Hi @alison - regarding timing, please see my note above:

In the meantime, if your sysadmin has another non-beegfs filesystem you could work on, that would be the quickest/easiest workaround.

alison · March 19, 2021, 3:43pm

I don't think we have any other available file system, but I'll ask. Would it potentially be less of a problem if I broke up the original fastq files into smaller pieces and ran them separately through demultiplexing and trimming to reduce the use of tmp memory? (And then cat them all together before denoising)

thermokarst · March 19, 2021, 4:18pm

Hi @alison - the size of the data isn't the issue, it is how the QIIME 2 Framework is interacting with the TMPDIR location that is the problem. In general everything is well behaved, except on certain networked filesystems, like beegfs.

If you can set your TMPDIR to a new location, that is the quickest workaround I have for you. If that new location has less disk space, then you might need to think about strategies for breaking things up into smaller chunks, but that is a secondary concern, and depends on the specifics of the replacement filesystem you're working with.

alison · March 19, 2021, 4:42pm

Thanks! It seemed like most of the questions about this mentioned re-directing tmp, and I hadn't hit the problem with a tiny data set that could run in the "default" tmp. So that's where my logic was, that maybe if I could keep it small enough to use default tmp instead of re-directing it might work (and save me waiting for my sysadmin to answer an email).

alison · March 23, 2021, 7:50am

At least on my cluster (Saga), creating a dedicated job tmp space seems to solve this problem.

requesting a job-specific tmp space in the job header:
#SBATCH --gres=localscratch:

directing TMP to that space:
export TMPDIR=$LOCALSCRATCH