Strategies for importing data and creating provenance without copying data to `\tmp`

crusher083 · December 20, 2022, 10:00am

The datasets are becoming bigger, and the issue of data management becomes more pronounced. During import operation, QIIME2 copies raw data into the archive in the /tmp, which quickly runs out of space with big datasets. Additionally, it creates a redundant copy of the raw data on the hard drive.

I had this problem while developing q2-mOTUs, but the same issue was raised in q2-fondue.
It would be useful, if during qiime import QIIME2 operated on a manifest file itself and only recorded the metadata of the dataset for provenance, somewhat similar to snakemake.

EDIT: I used TMPDIR change before, and I think it is inconvenient and should be improved in order to make Q2 more future-proof. Duplication of raw data doesn't give benefits, but requires time+space+compute.

@misialq maybe you have any ideas in that regard?

Cheers,
Valentyn

Nicholas_Bokulich · December 23, 2022, 6:04am

Hi @crusher083 ,

I have good news... see the release notes here: