Symlink to artifacts

Hi Dev team,

I'm having an issue where I would love to be able to operate on a symlink to a qiime2 Artifact rather than the artifact itself. My HPC is a tangled nightmare that hates me, and being able to symlink, rather than work on original files, would make the system slightly less hostile.

I assume the issue has to do with how paths are constructed but I don't understand enough of the architecture to address the problem. I know you're workin on a refactor, and wondered if maybe this could/should be part of it.

Best,
Justine

Hi @jwdebelius,

I work with Nextflow which creates symlink of qza needed at each step. What happens in your case?

Luca

@Oddant1, it seems like the artifact cache could be an alternative to symlinks here - do you agree? Is this content (work in progress!) the best source for docs on that right now?

Hi @llenzi and @gregcaporaso,

Thanks for the suggestions! I think describing my system might help. I'm working out of a seperate folder structure. So, I have two parallel directories:

raw_data/
    |_ study1
        |_ asv_table1.qza
        |_ req_seq1.qza
    |_ study2
        |_ asv_table2.qza
        |_ req_seq2.qza
    |_ ...

processed/
    |_ asv_tables/
    |_rep_seqs/
    |_ ...

I'd like to simlink from the "raw_data" file where I've done the per-study processing to the processed file so that the storage only lives in the raw data location, but when I (or someone else) looks for the artifact life, it apepars at the symlink location and behaves semi like a local file.

If I do a symlink

ln -s raw_data/study1/asv_table1.qza \
 processed/asv_tables/study1_asv_table.qza

And then try to merge the data, I get an error about how the file cant be read.

My current solution has been to just copy the artifact, which semi works for my 16S data, but we're currently under storage constrains and even having two copies of the same feature table is considered a problem.

While I like the idea of caching @gregcaporaso, I think its parallel but maybe not the same as my issue. I have heavy storage constraints, but have been told CPU hours are essentially free. So, reading/writing to disk is less of an issue for me, and the bigger issue is having the path read to the right location. Maybe caching might let be navigate this, but I'd need some hand holding to understand how.

Thanks,
Justine

@gregcaporaso. @ebolyen and I discussed the possibility of having cache keys symlink off to data that is somewhere else entirely as opposed to needing to be in the cache's data directory. Iirc we discussed potentially doing something similar to what is discussed here in implementing that, but that is not currently how the cache functions.

@jwdebelius I don't think the cache does what you want it to do at this time, but there has been discussion on making what you want happen both in and outside the context of a cache. At this point we don't have a timeline for it.

What the cache would allow you to do is basically specify some portion of your HPC's filesystem (generally you would want to specify a portion that is accessible to the entire HPC) for QIIME 2 to store all of the data it would usually just throw in your temp directory. You can also have QIIME 2 put outputs directly into the cache or pull inputs out of the cache. The cache stores artifacts in an unzipped form. This uses up more space which sounds like the opposite of what you want right now.

1 Like

Thanks for the input @Oddant1 and for the additional detail @jwdebelius.

@jwdebelius, two thoughts. First, I've had issues in the past when symlinking with relative paths, like in your example command. Have you tried symlinking with absolute paths instead?

Second, if you're having trouble linking to the artifacts themselves, could you symlink to the containing directory? I do that all the time and haven't had any QIIME 2 related issues with that. For example, adapted from your command above:

ln -s \
 /absolute/path/to/raw_data/study1/ \
 /absolute/path/to/processed/asv_tables/