Aiida uses a new working directory for each task to my understanding. For e.g. reason of I/O efficiency, I want to have multiple tasks in a work chain to put intermediate files in a common directory. Is there an intended way of accomplishing this?
Hi @cbehren,
If I/O is your main concern, perhaps you should try to use symlinking. Let me write down some ideas below, let me know if I’m in the ballpark.
Fast/Hacky approach: the prepend_text
The fastest way I can think of to achieve a restart from intermediate files from multiple calculations is to add a bunch of ln
lines to the metadata.options.prepend_text
of the follow-up CalcJob
you are trying to run. For example:
prepend_lines = []
for calcjob in (cj1, cj2):
prepend_lines.extend(
[f"ln -s {Path(calcjob.get_remote_workdir()) / file} {file}" for file in ('file1', 'file2')]
)
inputs['metadata']['options']['prepend_text'] = '\n'.join(prepend_lines)
In this example file1
and file2
will be symlinked from the previously run cj1
and cj2
.
Of course, this isn’t respecting the provenance very much, i.e. you wouldn’t be able to see that these files came from the previous calculations.
Using a dynamical input namespace
One idea would be for the follow-up calculation to have a dynamical input namespace where you can add multiple remote folders from the previous calculations. These could then be added to the remote_symlink_list
of the CalcInfo
instance returned at the end of the prepare_for_submission
of the final CalcJob
. This feature unfortunately doesn’t seem to be documented, but it works exactly like the remote_copy_list
:
In this case the remote_folder
outputs of the previous calculations would be nicely linked as inputs of the final one, and other AiiDA developers will nod in approval.
This one is a bit trickier to implement though, let me know if you need a hand.
These two ideas don’t place the intermediate files in a common directory, but I’m not sure that’s necessary for your use case? Also note that using symlinking can have pitfalls in case the final calculation changes the file contents.
I am not a 100% sure what you mean with I/O efficiency in this context.
There is current no way in the API to control the working directory of calculations. They will always be generated based on the UUID of the node, and so will be unique for each calculation. This is intentional though. If a task needs to operate on the output files of a previous file, then that really needs to be put in as an input to capture the provenance.
Files can be stored (roughly) in three ways in AiiDA:
SinglefileData
: a node that contains a single fileFolderData
: a node that contains any number of files (optionally nested in directories)RemoteData
: A “symlink” to a directory on a remote computer. Here AiiDA just stores the computer and the directory, it doesn’t actually store which files it contains nor their content.
These nodes can be used as inputs to CalcJob
s to make sure the files are available for input to the calculation. AiiDA supports using symlinking files from RemoteData
, which prevents having to duplicate the files from the working directory of one calculation to the next.
aiida-shell
makes this easy once again. Please have a look at this section in the documentation: How-to guides — aiida-shell 0.6.0 documentation
It shows how the working directory of a completed calculation can be passed to the next and you can tell it to use symlinks in order to prevent unnecessary copying.
As a quick example, if you have two tasks, of which the second needs the output files of the first, you could do something like the following:
from aiida.engine import workfunction
from aiida_shell import launch_shell_job
@workfunction
def workflow():
results_01, node_01 = launch_shell_job(
'command_1',
)
results_02, node_02 = launch-shell_job(
'command_2',
nodes={
'files_command_1': node_01.outputs.remote_folder
},
metadata={'options': {'use_symlinks': True}}
)
Dear @mbercx and @sphuber,
Thanks for the input! The RemoteData
approach or the remote_symlink_list
sound like what I need, I will try that first.
By I/O efficiency, I really just meant that I want to avoid to copy potentially large files or directories from job to job.
Originally, I thought it would be a good idea to really have a shared directory, but I do see that I can arrive at the same thing by symlinking the previous jobs.
Dear @sphuber, I tried to implement your variant, but I get the problem that apparently, the sym link options tries to link the dir of task 1 directly into the dir of task 2:
FileExistsError: [Errno 17] File exists: '/home/xyz/.aiida_run/30/8f/eb42-4341-4b3d-9e13-752f1c9b1eae/_aiidasubmit.sh' -> '/home/xyz/.aiida_run/f5/99/00a2-3deb-4a02-87b4-bb3e3f53dd2f/./_aiidasubmit.sh'
In the latter directory, all the files from the previous jobs are directly symlinked, when it tries to also copy the aiidasubmit, it obviously crashes - I would have expected the files to be linked into a subfolder of the second job’s output dir. Am I missing something?
No, your expectation is correct, this is just a bug in aiida-shell
that was recently uncovered. I only added the symlink functionality recently and didn’t think of this. Here is the issue: Bug: Default submit script is overridden by `RemoteData` · Issue #58 · sphuber/aiida-shell · GitHub
I have a fix for this is in this PR: `ShellJob`: Detect and prevent filename clashes by sphuber · Pull Request #70 · sphuber/aiida-shell · GitHub
But that requires a change in aiida-core
first, see this PR: `CalcJob`: Allow to define order of copying of input files by sphuber · Pull Request #6285 · aiidateam/aiida-core · GitHub
So if you check out the two branches of those repositories and install them, it should resolve the problem:
git clone https://github.com/sphuber/aiida-shell
cd aiida-shell
git checkout fix/068/input-output-filename-overlap
pip install -e .
cd ../
git clone https://github.com/sphuber/aiida-core
cd aiida-core
git checkout feature/6012/calcjob-file-copy-order
pip install -e .
It would be great if you could give that a try and let us know if it works. With the feedback, I could hopefully accelerate the review process and get the fixes merged and released. Thanks!
I applied both fixes, but I still get the same error message. Below the expanded message:
Error: Exception whilst using transport:
Traceback (most recent call last):
File “/home/xyz/test_aiida/aiida-core/src/aiida/engine/transports.py”, line 105, in request_transport
yield transport_request.future
File “/home/xyz/test_aiida/aiida-core/src/aiida/engine/processes/calcjobs/tasks.py”, line 94, in do_upload
remote_folder = execmanager.upload_calculation(node, transport, calc_info, folder)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/xyz/test_aiida/aiida-core/src/aiida/engine/daemon/execmanager.py”, line 217, in upload_calculation
_copy_remote_files(logger, node, computer, transport, remote_copy_list, remote_symlink_list)
File “/home/xyz/test_aiida/aiida-core/src/aiida/engine/daemon/execmanager.py”, line 316, in _copy_remote_files
transport.symlink(remote_abs_path, dest_rel_path)
File “/home/xyz/test_aiida/aiida-core/src/aiida/transports/plugins/local.py”, line 835, in symlink
os.symlink(os.path.join(this_file), os.path.join(self.curdir, remotedestination, this_remote_dest))
FileExistsError: [Errno 17] File exists: ‘/home/xyz/.aiida_run/e6/6e/dd03-c81d-4897-bcef-edd79ee0a54f/_aiidasubmit.sh’ → ‘/home/xyz/.aiida_run/87/96/e341-8453-441b-8939-398e6958e5bc/./_aiidasubmit.sh’
This is ls -ahl
on the relevant dir. Note the symbolic links in place. It copied over e.g. the status from the previous job:
drwxr-xr-x 3 xyz xyz 4,0K 13. Feb 22:40 .
drwxr-xr-x 3 xyz xyz 4,0K 13. Feb 22:40 …
drwxr-x— 2 xyz xyz 4,0K 13. Feb 22:40 .aiida
-rw-r–r-- 1 xyz xyz 446 13. Feb 22:40 _aiidasubmit.sh
lrwxrwxrwx 1 xyz xyz 86 13. Feb 22:40 _scheduler-stderr.txt → /home/xyz/.aiida_run/e6/6e/dd03-c81d-4897-bcef-edd79ee0a54f/_scheduler-stderr.txt
lrwxrwxrwx 1 xyz xyz 71 13. Feb 22:40 status → /home/xyz/.aiida_run/e6/6e/dd03-c81d-4897-bcef-edd79ee0a54f/status
lrwxrwxrwx 1 xyz xyz 71 13. Feb 22:40 stderr → /home/xyz/.aiida_run/e6/6e/dd03-c81d-4897-bcef-edd79ee0a54f/stderr
lrwxrwxrwx 1 xyz xyz 71 13. Feb 22:40 stdout → /home/xyz/.aiida_run/e6/6e/dd03-c81d-4897-bcef-edd79ee0a54f/stdout
Thanks for trying. I accidentally gave you the wrong branch. That only contains part of the fix. The branch feature/058/remote-folder-symlink
should fix this problem. However, I just realized that this is still not an ideal solution. Although it should now properly symlink all files from the first calculation, it essentially results in the original output files being overwritten. This might not immediately be a problem, since this files are just in the scratch files and the originals were saved permanently in AiiDA’s repository and they won’t be affected. But maybe this could be a problem further down the line, if the original RemoteFolder
is reused once again for a follow-up command. That is the nature of symlinks though.
Maybe I have to rethink the interface of the symlink functionality. Instead of just having a boolean flag that symlinks everything, the user would have to specify manually which files exactly should be symlinked. Do you have any thoughts on what a better interface could be?
Hey, the fix did something: I did not get an error message, but the result looks different from what I expected.
The directory of the second task looks like this:
drwxr-x— 2 xyz xyz 4.0K Feb 14 06:00 .aiida
lrwxrwxrwx 1 xyz xyz 80 Feb 14 06:00 _aiidasubmit.sh → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/_aiidasubmit.sh
lrwxrwxrwx 1 xyz xyz 86 Feb 14 06:00 _scheduler-stderr.txt → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/_scheduler-stderr.txt
lrwxrwxrwx 1 xyz xyz 86 Feb 14 06:00 _scheduler-stdout.txt → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/_scheduler-stdout.txt
lrwxrwxrwx 1 xyz xyz 71 Feb 14 06:00 status → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/status
lrwxrwxrwx 1 xyz xyz 71 Feb 14 06:00 stderr → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/stderr
lrwxrwxrwx 1 xyz xyz 71 Feb 14 06:00 stdout → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/stdout
lrwxrwxrwx 1 xyz xyz 76 Feb 14 06:00 myfiles → /home/xyz/.aiida_run/4e/85/72a7-065f-4486-8632-8f1c72f2e0f0/myfiles
So it now has linked all files to the previous job, including the _aiidasubmit.sh file, and it has overwritten the _aiidasubmit.sh of the previous job. What it has not linked is the .venv folder that is a result of task 1, but this might be intentional (note the dot in the beginng of the name - I guess you ignore those, which is fine. It shows up when I replace .venv by venv, removing the dot). The dir myfiles was generated from the second job. What I intended was something like
<UUID of Job 2>/
myfiles/
job1/
job1/venv/
where job1/ is a symbolic link to the working dir of job1. For me, the ideal way of handling this would be to be able to specify the directory I want to like to, and a name for the link.
Maybe it is helpful to explain what I want to do. The pipeline I am working on contains about 10 steps, only one of them is computationally heavy. In the process, ~6 git repos have to cloned, some of them contain data, other contain scripts that are used. I need to install them in a clean virtual environment, but for performance reasons, I would like to install all of them in the same virtual environment - so, I want to basically carry the virtual env from task to task without copying. The same goes for the cloned git repos. I should add that I also would not store the venv or the git repos in the database - the git repos should be characterized by their commit hashes anyway, and I want to store those in the database only.
However, going through your fix I just realized that aiida-shell will automatically replace the shell command with its absolute path in the startup environment of the computer - which means that I will not be able to use aiida-shell to utilize said virtual environment. I would need to tweak this here to get around this issue, right? I could add an option to launch_shell_job like defer_resolution=True
to jump over that which {command}
. Then, I could put something like python
as command, and the _aiida_submit.sh would contain just that command with no absolute path.
That is indeed useful. I am wondering if you actually shouldn’t store the virtual environment if reproducibility is crucial. Even if you install from some repo, and the commit hash is stored as an input, if it has any dependencies that are not all pinned to a particular version as well, running the same install in the future might actually produce a different env.
What I am thinking off is to first run a step that creates the env and then stores that in AiiDA using a FolderData
. You can then use this as an input to each next step and it will be copied to the working directory. This will guarantee perfect reproducibility, as you can always recover the exact virtual env in the future in which a step was run. The downside is that you would be storing quite a bit of data and copying it for each step to the scratch space.
Defining manually the working directory for a CalcJob
is currently not supported in aiida-core
. The working directory is generated automatically based on the work dir of the computer and the UUID of the CalcJobNode
, see here:
The reason for not allowing the same working directory is based at least on the same observation we made here, where certain default input/output files can overwrite one another.
I would need to tweak this here to get around this issue, right? I could add an option to launch_shell_job like
defer_resolution=True
to jump over thatwhich {command}
. Then, I could put something likepython
as command, and the _aiida_submit.sh would contain just that command with no absolute path.
This is something we can definitely add to aiida-shell
. I think that this would be necessary to support running commands inside a virtual environment. I will work on a PR today
I will come back to this later, just a quick remark on this part:
This should work - I just commented out the part in prepare_code right now, and passed on command
instead of executable
. The only other thing is that somewhere deeper down, e.g. python
is still going to replaced by whatever which
gives (I did not look up where exactly!). To make sure python
is run from the virtual env, one has to provide the command as venv/bin/python
.
An alternative is just to specify python
and then add the command
source venv/bin/activate
to the metadata.options.prepend_text
input. This command will be executed before the main executable is called, so it first loads the env, and then python
will point to the Python version installed in the virtual env.
https://aiida-shell.readthedocs.io/en/latest/howto.html#customizing-run-environment
Edit: I created a PR with an option to skip resolving the command: `launch_shell_job`: Add option to keep skip resolving of `command` by sphuber · Pull Request #73 · sphuber/aiida-shell · GitHub
You are right - this would be better, although it would require to store quote a large amount of data. I understand that the working directories are not changeable right now. My naive assumption was that there is some way to point to a shared folder within the framework of Aiida (outside of the working directories) but I see this might be difficult.
Sorry, source
does not work in my shell (but as it turned out, “.” instead of it works), so I missed that until now. Thanks.
Looks great, I will take a look at it.
This works fine for me!