AiiDA not getting number of jobs from scheduler

Hello everyone,

I recently configured AiiDA using Conda to run in a cluster using the transport ‘core.local’ and the ‘core.pbspro’ scheduler. When I run “verdi computer ‘mycomputer’ test”, all 6 tests pass normally.

But when I launch a workchain, AiiDA cannot see the status of my calculation and show: ‘Pausing after failed transport task: update_calculation failed 5 times consecutively’

If I do ‘verdi computer mycomputer test’ again after this, one of the test fails. Here is the error message:

"Getting number of jobs from scheduler... Error: There are lines without equals sign! [' case \\"$-\\" in ', ' *v*x*)', ' ;; *v*)', ' ;; *x*)', ' ;; esac; fi; if [ -n \\"${__lmod_sh_dbg:-}\\" ]; then'] [Failed]: SchedulerParsingError: There are lines without equals sign."

Have you ever encountered this error?
Thanks in advance!
Kind regards,
Thiago

It seems the call to your PBS scheduler to get the jobs is not respecting a response as expected. Can you run qstat -f manually in a terminal and report the output? That should give the status of all running jobs. Then run qstat -f 111 122 where the integers are the ids of running jobs. These are the commands that AiiDA should be running and apparently that is returning some bash code instead of data on their status, which is why it cannot determine the status of the jobs it is running.

Hi Sebastiaan, thanks for the quick reply.

Here is the output of qstat -f (just showing for one of my jobs):

Job Id: 26571.ada
    Job_Name = aiida-1100
    job_state = Q
    queue = par128
    server = ada
    Checkpoint = u
    ctime = Fri Nov 24 10:01:54 2023
    Error_Path = lovelace:/work/dorinitt/.aiida/f6/96/a6c0-6689-4234-a459-cf63d
	9933a5c/_scheduler-stderr.txt
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = n
    mtime = Fri Nov 24 10:01:55 2023
    Output_Path = lovelace:/work/dorinitt/.aiida/f6/96/a6c0-6689-4234-a459-cf63
	d9933a5c/_scheduler-stdout.txt
    Priority = 0
    qtime = Fri Nov 24 10:01:54 2023
    Rerunable = False
    Resource_List.cput = 21504:00:00
    Resource_List.mpiprocs = 128
    Resource_List.ncpus = 128
    Resource_List.nodect = 1
    Resource_List.place = free
    Resource_List.Qlist = par128
    Resource_List.select = 1:mpiprocs=128:ncpus=128
    Resource_List.walltime = 168:00:00
    substate = 10
    Variable_List = PBS_O_HOME=/home/lovelace/proj/proj962/dorinitt,
	PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=dorinitt,
	PBS_O_PATH=/home/lovelace/proj/proj962/dorinitt/programs/miniforge3/en
	vs/aiida/bin:/home/lovelace/proj/proj962/dorinitt/programs/miniforge3/e
	nvs/aiida/bin:/home/lovelace/proj/proj962/dorinitt/programs/miniforge3/
	envs/aiida/bin:/home/lovelace/proj/proj962/dorinitt/programs/miniforge3
	/condabin:/home/lovelace/proj/proj962/dorinitt/.local/bin:/home/lovelac
	e/proj/proj962/dorinitt/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/us
	r/sbin:/opt/pbs/bin,PBS_O_MAIL=/var/spool/mail/dorinitt,
	AIIDA_PATH=/home/lovelace/proj/proj962/dorinitt/.aiida,
	MODULESHOME=/opt/ohpc/admin/lmod/lmod,CONDA_DEFAULT_ENV=aiida,
	LMOD_SETTARG_FULL_SUPPORT=no,HISTSIZE=1000,
	LMOD_PKG=/opt/ohpc/admin/lmod/lmod,
	XML_CATALOG_FILES=file:///home/lovelace/proj/proj962/dorinitt/programs
	/miniforge3/envs/aiida/etc/xml/catalog file:///etc/xml/catalog,
	LMOD_CMD=/opt/ohpc/admin/lmod/lmod/libexec/lmod,
	LESSOPEN=||/usr/bin/lesspipe.sh %s,OMP_NUM_THREADS=1,
	LMOD_FULL_SETTARG_SUPPORT=no,
	LMOD_DIR=/opt/ohpc/admin/lmod/lmod/libexec,LC_TIME=pt_BR.UTF-8,
	BASH_FUNC_which%%=() {  ( alias; eval ${which_declare} ) | /usr/bin/wh
	ich --tty-only --read-alias --read-functions --show-tilde --show-dot $@
	
},
	BASH_FUNC_module%%=() {  if [ -z \"${LMOD_SH_DBG_ON+x}\" ]; then
 case
	 \"$-\" in 
 *v*x*)
 __lmod_sh_dbg=\'vx\'
 ;; *v*)
 __lmod_sh_dbg=\'v\'
	
 ;; *x*)
 __lmod_sh_dbg=\'x\'
 ;; esac; fi; if [ -n \"${__lmod_sh_dbg:
	-}\" ]; then
 set +$__lmod_sh_dbg; echo \"Shell debugging temporarily s
	ilenced: export LMOD_SH_DBG_ON=1 for Lmod\'s output\" 1>&2; fi; eval \"
	$($LMOD_CMD $LMOD_SHELL_PRGM \"$@\")\" && eval \"$(${LMOD_SETTARG_CMD:-
	:} -s sh)\"; __lmod_my_status=$?; if [ -n \"${__lmod_sh_dbg:-}\" ]; the
	n
 echo \"Shell debugging restarted\" 1>&2; set -$__lmod_sh_dbg; fi; un
	set __lmod_sh_dbg; return $__lmod_my_status
},
	BASH_FUNC_ml%%=() {  eval \"$($LMOD_DIR/ml_cmd \"$@\")\"
},
	_=/opt/pbs/bin/qsub,PBS_O_QUEUE=par128,PBS_O_HOST=lovelace
    comment = Not Running: Insufficient amount of resource: Qlist 
    etime = Fri Nov 24 10:01:54 2023
    eligible_time = 00:00:20
    Submit_arguments = _aiidasubmit.sh
    project = _pbs_project_default
    Submit_Host = lovelace

With this output I saw that the error come from the “BASH_FUNC_module”. I don’t quite understand why it is writing this (weirdly only to my jobs). I have AiiDA running normally on another computer computer that also uses PBS, and this line does not appear on the qstat -f for my jobs. Do you think it can be related to the VASP that is compiled on the environment module of the cluster?

That is the problem. AiiDA doesn’t expect this output at the end starting with the line

	BASH_FUNC_module%%=() {  if [ -z \"${LMOD_SH_DBG_ON+x}\" ]; then

Is this scheduler running in an HPC environment? If so, you should probably ask your administrator.

I think the verdi computer test is passing without problems becuase that is just running qstat -f i.e., without any specific job ids. Could you check that that command doesn’t include the spurious shell code at the end?

It is running in an HPC environment, so I’ll ask the administrator.

I noticed that there are other jobs (not all of them) from other users that also show this spurious BASH_FUNC_module, so every time there is a job with this line, AiiDA will fail to connect to the scheduler. This may come from the way they installed the scheduler in the cluster, I will ask the support to see if they can solve this.

If they cannot fix it, an alternative solution is to subclass the PbsproScheduler plugin that comes with aiida-core and override the parsing function to deal with the spurious output. If you know what it looks like consistently, you can simply ignore and still parse the actual data from the response, since it is there. You just register this subclass with an entry point, e.g. aiida.schedulers:custom.pbspro and then you can configure your computer to use scheduler type custom.pbspro and things should work.

1 Like

I see, I think this might be an easier solution. I saw that this error message is implemented in the aiida.schedulers.plugins.pbsbaseclasses:

# There are lines without equals sign: this is bad
            if lines_without_equals_sign:
                # Should I only warn?
                _LOGGER.error(f'There are lines without equals sign! {lines_without_equals_sign}')
                raise SchedulerParsingError('There are lines without equals sign.')

Maybe if I simply comment this if statement it might work? If it don’t I’ll look in more detail in the code to try to correct it.

Sorry for my innocent question, but I am not familiar on how to register a custom subclass like aiida.schedulers:custom.pbspro in AiiDA, is there a tutorial on how to do it?

There is information on the documentation: How to package plugins — AiiDA 2.4.0.post0 documentation
Note that it provides an example for a complete plugin package with calcjob plugins, parsers etc. You would need none of that, just the Scheduler plugin, which is just the subclass of PbsproScheduler.

Ok! Thanks a lot Sebastiaan! :slight_smile:

It worked!
I did not need to create a new scheduler custom plugin for the moment, what I did was simply comment the ‘if’ statement that I showed you in the pbsbaseclasses.py in my venv: …/miniforge3/envs/aiida/lib/python3.11/site-packages/aiida/schedulers/plugin/

Now when I launched a workchain it shows “Monitoring scheduler: job state QUEUED”, which shows that it worked, even though that weird BASH_FUNC_module variable still appears in ‘qstat -f’.

Thanks a lot for the help :slight_smile:

1 Like

Great. You will just have to remember that if you update AiiDA and it installs a new version, those changes my get undone, so you have to reapply them. But for now it is a good solution if it helps you to start running things.

1 Like

I’ll keep it in mind, thanks!

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.