How to Reset/fix FromGroupSubmissionController

Hi all,

I have been working with an aiida-submission-controller, FromGroupSubmissionController and have come across an issue where it is saying that there are no available slots. However, I have a limit of 100, and only 6 processes of the group that the controller uses are showing in verdi process list.

Is there a way I can reset the controller/force it to update or recheck the active processes, or otherwise fix this issue?

I’m using the SmilesToGaussianController in my aiida-aimall package

smile_controller = SmilesToGaussianController(
    parent_group_label = 'smiles', # A group of orm.Str for molecule SMILES
    group_label = 'g16_opt', # group to put Workchain nodes into
    code_label='gaussian@cedar',
    g16_opt_params=orm.Dict(dict={
        'link0_parameters': {
            '%chk':'aiida.chk',
            "%mem": "2000MB",
            "%nprocshared": 4,
        },
        'functional':'wb97xd',
        'basis_set':'aug-cc-pvtz',
        'route_parameters': {'opt': None, 'freq':None,'Output':'WFX'},
        "input_parameters": {"output.wfx\n": None, "output2.wfx":None},
    }),
    wfxgroup = "opt_wfx",
    nprocs = 4,
    mem_mb = 3200,
    time_s = 60*60*24*7,
    max_concurrent = 100
)

smile_controller.num_available_slots

prints 0,

Again, only 6 processes are showing in verdi process list, and the limit should be 100.

Any help would be appreciated, thanks!

So, working through and I just found all the processes that it thinks are active using the get_query part of the controller. However, the processes that returned have process_state as ‘excepted’, so I still need to see if there is a way to get these out of what the controller views as active processes

Hi @kmlefran!

The number of available slots is always updated/recalculated when you call SubmissionController.num_available_slots or SubmissionController.submit_new_batch(). Moreover, excepted processes should be sealed and therefore ignored in the get_query(only_active=True) call. Did you specify only_active=True when you checked the get_query() method? This is relevant because get_query() would otherwise list all submitted processes, which would be correct in that scenario.

It would be good to get more details. How many smiles, i.e. Str nodes, are in the initial group and how many processes are already in the final group? E.g. how many jobs do you actually expect that need to be submitted.

Another detail that I realized in your implementation:
Does the following actually work? (aiida-aimall/aiida_aimall/controllers.py at main · kmlefran/aiida-aimall · GitHub)

return inputs, WorkflowFactory(self.WORKFLOW_ENTRY_POINT)

The returned tuple is passed to aiida.engine.submit in the BaseSubmissionController which expects the opposite order in principle. I made a small test and in that case it actually failed.

Happy to investigate this further once you provide more information.

Hi,

I did include only_active=True in the get_query call, and it returned the expected 100 processes. The processes were excepted but I guess they weren’t sealed for some reason? Not sure why.

What I’ve done for now is remove those nodes from the destination group to free up the active slots. The initial group has ~27000 Str nodes and to date ~2300 have been run using the controller. So, yeah it actually does work! That may be because I might not have updated aiida-submission-controller. Looking at the old version of add_in_batches, before an update 2 months ago, I see:

        code = orm.load_code(self.code_label)
        inputs = {
            "code": code,
            "x": orm.Int(extras_values[0]),
            "y": orm.Int(extras_values[1]),
        }
        return inputs, CalculationFactory(code.get_input_plugin_name())

Which has the same order as mine. I drew pretty heavily from the examples almost a year ago when I was originally writing these, so that makes sense. I guess I’ll update my installation of the aiida-submission-controller and update the code as well to match the newer version. I don’t think that’s related to the issue I’m having here.

Just seems like the issue here is that my processes weren’t sealed despite being excepted for some reason. Probably unrelated to aiida-submission-controller itself I imagine. Anyone have any idea why that might be?

Quick note that my aiida-submission-controller is the same as the latest PyPi release - the changes with the updated submit_new_batch which gets a builder from get_inputs_and_processclass_from_extras hasn’t been released there yet, so that’s why I’m not getting an error since I installed using pip.

I can still install from the repository and update my stuff anyways, but just making a note of that.

Hi @kmlefran

Yes, you are right. I forgot about that update and it makes totally sense that your approach is working. I was also not expecting that this is related to the issue you encountered, but only a general comment out of curiosity.

It’s indeed strange that these processes are excepted but not sealed. You can run node.is_sealed (where node is your excepted ProcessNode) to confirm that they were indeed not sealed. However, it seems that this has to be the reason based on your observations and having in mind that your implementation already worked for a large number of submissions.

Pinging @mbercx and @sphuber, am I missing something about excepted processes not being sealed or a known problem concerning this issue? (at least I didn’t find any previous issue/PR on aiida-core)

An excepted node should be sealed. So @kmlefran if you do in fact have excepted processes that are not sealed, that should be considered a bug. It would be great if you could come up with a reproducible example so I could have a look at how to fix it.

Yeah, a query is showing that this is the case for me:

qb  = QueryBuilder()
qb.append(Node,
    filters = {
        "or": [
            {"attributes.sealed": False},
            {"attributes": {"!has_key": "sealed"}},
        ],
        'attributes.process_state': 'excepted',
        })
qb.all()

Returns some nodes. I don’t know if I have anything to reproduce this reliably yet.
The error in some of these processes that caused the exception was that the daemon was using the wrong Python environment and tried to use a function that wasn’t in that function

Ok, that would indeed be an anamolous situation. For the time being, you can iterate over those inconsistent nodes and call node.seal() to seal them manually. I think that should work. At least that way they won’t interfere with your submission controller

Also, if you happen to still have or find in the logs the traceback, it might help seeing where the exception was raised and why the sealing was not triggered

Here’s the process report

2024-06-25 17:05:58 [6930 | REPORT]: [80542|SmilesToGaussianWorkchain|on_except]: Traceback (most recent call last):
  File "/Users/chemlab/anaconda3/lib/python3.11/site-packages/plumpy/processes.py", line 185, in spec
    return cls.__getattribute__(cls, '_spec')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Protect' object has no attribute '_spec'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/chemlab/anaconda3/lib/python3.11/site-packages/plumpy/base/state_machine.py", line 324, in transition_to
    self._enter_next_state(new_state)
  File "/Users/chemlab/anaconda3/lib/python3.11/site-packages/plumpy/base/state_machine.py", line 388, in _enter_next_state
    self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
  File "/Users/chemlab/anaconda3/lib/python3.11/site-packages/plumpy/base/state_machine.py", line 300, in _fire_state_event
    callback(self, hook, state)
  File "/Users/chemlab/anaconda3/lib/python3.11/site-packages/plumpy/processes.py", line 334, in <lambda>
    lambda _s, _h, from_state: self.on_entered(cast(Optional[process_states.State], from_state)),
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chemlab/anaconda3/lib/python3.11/site-packages/aiida/engine/processes/process.py", line 428, in on_entered
    self.update_outputs()
  File "/Users/chemlab/anaconda3/lib/python3.11/site-packages/aiida/engine/processes/process.py", line 665, in update_outputs
    outputs_flat = self._flat_outputs()
                   ^^^^^^^^^^^^^^^^^^^^
  File "/Users/chemlab/anaconda3/lib/python3.11/site-packages/aiida/engine/processes/process.py", line 851, in _flat_outputs
    return dict(self._flatten_outputs(self.spec().outputs, self.outputs))
                                      ^^^^^^^^^^^
  File "/Users/chemlab/anaconda3/lib/python3.11/site-packages/aiida/engine/processes/workchains/workchain.py", line 135, in spec
    return super().spec()  # type: ignore[return-value]
           ^^^^^^^^^^^^^^
  File "/Users/chemlab/anaconda3/lib/python3.11/site-packages/aiida/engine/processes/process.py", line 88, in spec
    return super().spec()  # type: ignore[return-value]
           ^^^^^^^^^^^^^^
  File "/Users/chemlab/anaconda3/lib/python3.11/site-packages/plumpy/processes.py", line 190, in spec
    cls.define(cls._spec)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chemlab/anaconda3/lib/python3.11/site-packages/aiida_aimall/workchains.py", line 440, in define
    cls.update_paramaters_with_cm,
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: type object 'SmilesToGaussianWorkchain' has no attribute 'update_paramaters_with_cm'```