Implementing the Flux scheduler from LLNL

Hello everyone,

I am looking to possibly start implementing a scheduler plugin for the Flux scheduler that has been developed here at LLNL. This will be very useful as the next generation of super computers at the lab will only be running Flux.

There are some unique capabilities within Flux that I would like to take advantage of as I develop this but I’m not sure how well it fits into the AiiDA framework/thought process. Flux will work like a normal scheduler, lets say Slurm, where you can submit individual jobs to the queue. On top of this, you can request a Flux instance which will let you setup your personal queue(s) as needed for your tasks. Imagine being able to keep open a Slurm allocation and submitting the next step of your workflow without having to wait for a job to be submitted. There are some workflows that already take advantage of this at the lab and the throughput is very impressive. One example of this is Merlin.

Has something similar to this been done? Are there any pitfalls that can be thought of or is this not feasible at all? I’d appreciate any and all feedback on this. Thanks.

Nathan Keilbart

Hi Nathan,

Thanks a lot for getting in touch, It seems quite feasible to me.

Having a look at the abstract class for schedulers plugin I think a LLNL plugin can be developed only by implementing four abstract methods:

  • get_submit_script_header: basically define the generic structure of the submission script
  • submit_job: command to submit, this function should recieve and return a jobID → from what I read here Launching and tracking Flux jobs | LLNL HPC Tutorials , apparently also LLNL return something similar.
  • get_jobs: command to get status of one(or list of) job(s). Should update the status of aiida process as Queued, Running, Finished, etc…
  • kill_job: command to kill a job. Input value is jobID, return True or False

So implementing LLNL interface, should be rather straightforward. (assuming that you guys use SSH, If that’s not the case, then one has to develop also a transport plugin, and that’s more work…)

And about your main question regarding creating a new scheduler queue per workflow and submitting to that, I guess one should develop two extra methods for the scheduler plugin:

  • e.g. make_my_queue: command to create a Q in HPC
  • e.g. delete_my_queue: command to remove a Q from HPC

This way, you’ll need to do some extra steps for your aiida workflows. for example:

from aiida.orm.utils.loaders import load_computer
from aiida.calculations.arithmetic.add import ArithmeticAddCalculation

computer = load_computer(<COMPUTER NAME>)
scheduler = computer.get_scheduler()
# here you create the Q on remote
scheduler.make_my_queue(<Q_name>)

# and then in your workflow, specify <Q_name> 
# for e.g. with #SBATCH --partition=Q_name
builder = ArithmeticAddCalculation.get_builder()
builder.code = orm.load_code(<CODE_NAME>)
builder.x = orm.Int(extras_values[0])
builder.y = orm.Int(extras_values[1])
builder.metadata.options.custom_scheduler_commands = (<LLNL_Q_COMMAND>)
builder.submit()

# here, you eliminate the Q, if you want.
scheduler.delete_my_queue(<Q_name>)

Please don’t hesitate to ask more questions if unclear or if you need support developing it!

1 Like

Thanks for the quick response. I had to make sure I’d have the time to work on this and it looks like that is the case. I’ll reach back out as I have questions.

2 Likes

Hi @nkeilbart, when I try to bring a local scheduler for AiiDAlab, I was checking both Flux and hyperqueue.

I think those two are fit for the same goal to have a lightweight scheduler in your local machine. I finally decide to use hyperqueue because it can request for a more grind CPU allocation which is our requirement for the AiiDAlab container that may runs on the container that has non-integer cores allocated.

You can try the aiida-hyperqueue to see if it fit your goal.
Here is how I set up an aiida computer (localhost) for hyperqueue scheduler: aiidalab-qe/before-notebook.d/42_setup-hq-computer.sh at main · aiidalab/aiidalab-qe · GitHub

Here is how I start the hyperqueue service to submit the job: aiidalab-qe/before-notebook.d/43_start-hq.sh at main · aiidalab/aiidalab-qe · GitHub (basically, start the server and then start a worker with given number of cores)

But for sure, feel free to try to implement Flux scheduler plugin.