Use a common directory for a number of tasks

Aiida uses a new working directory for each task to my understanding. For e.g. reason of I/O efficiency, I want to have multiple tasks in a work chain to put intermediate files in a common directory. Is there an intended way of accomplishing this?

Hi @cbehren, :wave:

If I/O is your main concern, perhaps you should try to use symlinking. Let me write down some ideas below, let me know if I’m in the ballpark.

Fast/Hacky approach: the prepend_text

The fastest way I can think of to achieve a restart from intermediate files from multiple calculations is to add a bunch of ln lines to the metadata.options.prepend_text of the follow-up CalcJob you are trying to run. For example:

prepend_lines = []

for calcjob in (cj1, cj2):
    prepend_lines.extend(
        [f"ln -s {Path(calcjob.get_remote_workdir()) / file} {file}" for file in ('file1', 'file2')]
    )

inputs['metadata']['options']['prepend_text'] = '\n'.join(prepend_lines)

In this example file1 and file2 will be symlinked from the previously run cj1 and cj2.

Of course, this isn’t respecting the provenance very much, i.e. you wouldn’t be able to see that these files came from the previous calculations.

Using a dynamical input namespace

One idea would be for the follow-up calculation to have a dynamical input namespace where you can add multiple remote folders from the previous calculations. These could then be added to the remote_symlink_list of the CalcInfo instance returned at the end of the prepare_for_submission of the final CalcJob. This feature unfortunately doesn’t seem to be documented, but it works exactly like the remote_copy_list:

https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/calculations/usage.html#remote-copy-list

In this case the remote_folder outputs of the previous calculations would be nicely linked as inputs of the final one, and other AiiDA developers will nod in approval.

This one is a bit trickier to implement though, let me know if you need a hand.


These two ideas don’t place the intermediate files in a common directory, but I’m not sure that’s necessary for your use case? Also note that using symlinking can have pitfalls in case the final calculation changes the file contents.

I am not a 100% sure what you mean with I/O efficiency in this context.

There is current no way in the API to control the working directory of calculations. They will always be generated based on the UUID of the node, and so will be unique for each calculation. This is intentional though. If a task needs to operate on the output files of a previous file, then that really needs to be put in as an input to capture the provenance.

Files can be stored (roughly) in three ways in AiiDA:

  • SinglefileData: a node that contains a single file
  • FolderData: a node that contains any number of files (optionally nested in directories)
  • RemoteData: A “symlink” to a directory on a remote computer. Here AiiDA just stores the computer and the directory, it doesn’t actually store which files it contains nor their content.

These nodes can be used as inputs to CalcJobs to make sure the files are available for input to the calculation. AiiDA supports using symlinking files from RemoteData, which prevents having to duplicate the files from the working directory of one calculation to the next.

aiida-shell makes this easy once again. Please have a look at this section in the documentation: How-to guides — aiida-shell 0.6.0 documentation
It shows how the working directory of a completed calculation can be passed to the next and you can tell it to use symlinks in order to prevent unnecessary copying.

As a quick example, if you have two tasks, of which the second needs the output files of the first, you could do something like the following:

from aiida.engine import workfunction
from aiida_shell import launch_shell_job

@workfunction
def workflow():
    results_01, node_01 = launch_shell_job(
        'command_1',
    )

    results_02, node_02 = launch-shell_job(
        'command_2',
        nodes={
            'files_command_1': node_01.outputs.remote_folder
        },
        metadata={'options': {'use_symlinks': True}}
    )

Dear @mbercx and @sphuber,
Thanks for the input! The RemoteData approach or the remote_symlink_list sound like what I need, I will try that first.
By I/O efficiency, I really just meant that I want to avoid to copy potentially large files or directories from job to job.
Originally, I thought it would be a good idea to really have a shared directory, but I do see that I can arrive at the same thing by symlinking the previous jobs.

Dear @sphuber, I tried to implement your variant, but I get the problem that apparently, the sym link options tries to link the dir of task 1 directly into the dir of task 2:

FileExistsError: [Errno 17] File exists: '/home/xyz/.aiida_run/30/8f/eb42-4341-4b3d-9e13-752f1c9b1eae/_aiidasubmit.sh' -> '/home/xyz/.aiida_run/f5/99/00a2-3deb-4a02-87b4-bb3e3f53dd2f/./_aiidasubmit.sh'

In the latter directory, all the files from the previous jobs are directly symlinked, when it tries to also copy the aiidasubmit, it obviously crashes - I would have expected the files to be linked into a subfolder of the second job’s output dir. Am I missing something?

No, your expectation is correct, this is just a bug in aiida-shell that was recently uncovered. I only added the symlink functionality recently and didn’t think of this. Here is the issue: Bug: Default submit script is overridden by `RemoteData` · Issue #58 · sphuber/aiida-shell · GitHub

I have a fix for this is in this PR: `ShellJob`: Detect and prevent filename clashes by sphuber · Pull Request #70 · sphuber/aiida-shell · GitHub

But that requires a change in aiida-core first, see this PR: `CalcJob`: Allow to define order of copying of input files by sphuber · Pull Request #6285 · aiidateam/aiida-core · GitHub

So if you check out the two branches of those repositories and install them, it should resolve the problem:

git clone https://github.com/sphuber/aiida-shell
cd aiida-shell
git checkout fix/068/input-output-filename-overlap 
pip install -e .

cd ../
git clone https://github.com/sphuber/aiida-core
cd aiida-core
git checkout feature/6012/calcjob-file-copy-order
pip install -e .

It would be great if you could give that a try and let us know if it works. With the feedback, I could hopefully accelerate the review process and get the fixes merged and released. Thanks!

I applied both fixes, but I still get the same error message. Below the expanded message:
Error: Exception whilst using transport:
Traceback (most recent call last):

File “/home/xyz/test_aiida/aiida-core/src/aiida/engine/transports.py”, line 105, in request_transport
yield transport_request.future
File “/home/xyz/test_aiida/aiida-core/src/aiida/engine/processes/calcjobs/tasks.py”, line 94, in do_upload
remote_folder = execmanager.upload_calculation(node, transport, calc_info, folder)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/xyz/test_aiida/aiida-core/src/aiida/engine/daemon/execmanager.py”, line 217, in upload_calculation
_copy_remote_files(logger, node, computer, transport, remote_copy_list, remote_symlink_list)
File “/home/xyz/test_aiida/aiida-core/src/aiida/engine/daemon/execmanager.py”, line 316, in _copy_remote_files
transport.symlink(remote_abs_path, dest_rel_path)
File “/home/xyz/test_aiida/aiida-core/src/aiida/transports/plugins/local.py”, line 835, in symlink
os.symlink(os.path.join(this_file), os.path.join(self.curdir, remotedestination, this_remote_dest))
FileExistsError: [Errno 17] File exists: ‘/home/xyz/.aiida_run/e6/6e/dd03-c81d-4897-bcef-edd79ee0a54f/_aiidasubmit.sh’ → ‘/home/xyz/.aiida_run/87/96/e341-8453-441b-8939-398e6958e5bc/./_aiidasubmit.sh’

This is ls -ahl on the relevant dir. Note the symbolic links in place. It copied over e.g. the status from the previous job:

drwxr-xr-x 3 xyz xyz 4,0K 13. Feb 22:40 .
drwxr-xr-x 3 xyz xyz 4,0K 13. Feb 22:40 …
drwxr-x— 2 xyz xyz 4,0K 13. Feb 22:40 .aiida
-rw-r–r-- 1 xyz xyz 446 13. Feb 22:40 _aiidasubmit.sh
lrwxrwxrwx 1 xyz xyz 86 13. Feb 22:40 _scheduler-stderr.txt → /home/xyz/.aiida_run/e6/6e/dd03-c81d-4897-bcef-edd79ee0a54f/_scheduler-stderr.txt
lrwxrwxrwx 1 xyz xyz 71 13. Feb 22:40 status → /home/xyz/.aiida_run/e6/6e/dd03-c81d-4897-bcef-edd79ee0a54f/status
lrwxrwxrwx 1 xyz xyz 71 13. Feb 22:40 stderr → /home/xyz/.aiida_run/e6/6e/dd03-c81d-4897-bcef-edd79ee0a54f/stderr
lrwxrwxrwx 1 xyz xyz 71 13. Feb 22:40 stdout → /home/xyz/.aiida_run/e6/6e/dd03-c81d-4897-bcef-edd79ee0a54f/stdout

Thanks for trying. I accidentally gave you the wrong branch. That only contains part of the fix. The branch feature/058/remote-folder-symlink should fix this problem. However, I just realized that this is still not an ideal solution. Although it should now properly symlink all files from the first calculation, it essentially results in the original output files being overwritten. This might not immediately be a problem, since this files are just in the scratch files and the originals were saved permanently in AiiDA’s repository and they won’t be affected. But maybe this could be a problem further down the line, if the original RemoteFolder is reused once again for a follow-up command. That is the nature of symlinks though.

Maybe I have to rethink the interface of the symlink functionality. Instead of just having a boolean flag that symlinks everything, the user would have to specify manually which files exactly should be symlinked. Do you have any thoughts on what a better interface could be?

Hey, the fix did something: I did not get an error message, but the result looks different from what I expected.

The directory of the second task looks like this:

drwxr-x— 2 xyz xyz 4.0K Feb 14 06:00 .aiida
lrwxrwxrwx 1 xyz xyz 80 Feb 14 06:00 _aiidasubmit.sh → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/_aiidasubmit.sh
lrwxrwxrwx 1 xyz xyz 86 Feb 14 06:00 _scheduler-stderr.txt → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/_scheduler-stderr.txt
lrwxrwxrwx 1 xyz xyz 86 Feb 14 06:00 _scheduler-stdout.txt → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/_scheduler-stdout.txt
lrwxrwxrwx 1 xyz xyz 71 Feb 14 06:00 status → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/status
lrwxrwxrwx 1 xyz xyz 71 Feb 14 06:00 stderr → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/stderr
lrwxrwxrwx 1 xyz xyz 71 Feb 14 06:00 stdout → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/stdout
lrwxrwxrwx 1 xyz xyz 76 Feb 14 06:00 myfiles → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/myfiles

So it now has linked all files to the previous job, including the _aiidasubmit.sh file, and it has overwritten the _aiidasubmit.sh of the previous job. What it has not linked is the .venv folder that is a result of task 1, but this might be intentional (note the dot in the beginng of the name - I guess you ignore those, which is fine. It shows up when I replace .venv by venv, removing the dot). The dir myfiles was generated from the second job. What I intended was something like

<UUID of Job 2>/
myfiles/
job1/
job1/venv/

where job1/ is a symbolic link to the working dir of job1. For me, the ideal way of handling this would be to be able to specify the directory I want to like to, and a name for the link.

Maybe it is helpful to explain what I want to do. The pipeline I am working on contains about 10 steps, only one of them is computationally heavy. In the process, ~6 git repos have to cloned, some of them contain data, other contain scripts that are used. I need to install them in a clean virtual environment, but for performance reasons, I would like to install all of them in the same virtual environment - so, I want to basically carry the virtual env from task to task without copying. The same goes for the cloned git repos. I should add that I also would not store the venv or the git repos in the database - the git repos should be characterized by their commit hashes anyway, and I want to store those in the database only.

However, going through your fix I just realized that aiida-shell will automatically replace the shell command with its absolute path in the startup environment of the computer - which means that I will not be able to use aiida-shell to utilize said virtual environment. I would need to tweak this here to get around this issue, right? I could add an option to launch_shell_job like defer_resolution=True to jump over that which {command}. Then, I could put something like python as command, and the _aiida_submit.sh would contain just that command with no absolute path.

That is indeed useful. I am wondering if you actually shouldn’t store the virtual environment if reproducibility is crucial. Even if you install from some repo, and the commit hash is stored as an input, if it has any dependencies that are not all pinned to a particular version as well, running the same install in the future might actually produce a different env.

What I am thinking off is to first run a step that creates the env and then stores that in AiiDA using a FolderData. You can then use this as an input to each next step and it will be copied to the working directory. This will guarantee perfect reproducibility, as you can always recover the exact virtual env in the future in which a step was run. The downside is that you would be storing quite a bit of data and copying it for each step to the scratch space.

Defining manually the working directory for a CalcJob is currently not supported in aiida-core. The working directory is generated automatically based on the work dir of the computer and the UUID of the CalcJobNode, see here:

The reason for not allowing the same working directory is based at least on the same observation we made here, where certain default input/output files can overwrite one another.

I would need to tweak this here to get around this issue, right? I could add an option to launch_shell_job like defer_resolution=True to jump over that which {command}. Then, I could put something like python as command, and the _aiida_submit.sh would contain just that command with no absolute path.

This is something we can definitely add to aiida-shell. I think that this would be necessary to support running commands inside a virtual environment. I will work on a PR today

I will come back to this later, just a quick remark on this part:

This should work - I just commented out the part in prepare_code right now, and passed on command instead of executable. The only other thing is that somewhere deeper down, e.g. python is still going to replaced by whatever which gives (I did not look up where exactly!). To make sure python is run from the virtual env, one has to provide the command as venv/bin/python.

An alternative is just to specify python and then add the command

source venv/bin/activate

to the metadata.options.prepend_text input. This command will be executed before the main executable is called, so it first loads the env, and then python will point to the Python version installed in the virtual env.

https://aiida-shell.readthedocs.io/en/latest/howto.html#customizing-run-environment

Edit: I created a PR with an option to skip resolving the command: `launch_shell_job`: Add option to keep skip resolving of `command` by sphuber · Pull Request #73 · sphuber/aiida-shell · GitHub

You are right - this would be better, although it would require to store quote a large amount of data. I understand that the working directories are not changeable right now. My naive assumption was that there is some way to point to a shared folder within the framework of Aiida (outside of the working directories) but I see this might be difficult.

Sorry, source does not work in my shell (but as it turned out, “.” instead of it works), so I missed that until now. Thanks.

Looks great, I will take a look at it.

This works fine for me!