Aiida_vibroscopy: PW calculation got interrupted in DielectricWorkChain

Hi everyone,
I am running IRamanSpectraWorkChain of aiida-vibroscopy. In DielectricWorkChain, step the pw.x calculation failed with the following error (end of file read error). I am not sure if I should post this in QE community, I think this is more related due to aiida workflow, as my input file for normal pw.x calculation runs smoothly. I am not sure cause of this error.

IRamanSpectraWorkChain<5509> Waiting [None]
    └── HarmonicWorkChain<5511> Waiting [1:if_(should_run_parallel)]
        ├── generate_preprocess_data<5512> Finished [0]
        ├── PhononWorkChain<5514> Finished [0] [7:if_(should_run_phonopy)]
        │   ├── generate_preprocess_data<5515> Finished [0]
        │   ├── get_supercell<5517> Finished [0]
        │   ├── PwBaseWorkChain<5521> Finished [0] [3:results]
        │   │   └── PwCalculation<5527> Finished [0]
        │   ├── get_supercells_with_displacements<5537> Finished [0]
        │   ├── PwBaseWorkChain<5551> Finished [0] [3:results]
        │   │   └── PwCalculation<5554> Finished [0]
        │   ├── PwBaseWorkChain<5556> Finished [0] [3:results]
        │   │   └── PwCalculation<5603> Finished [0]
        │   ├── PwBaseWorkChain<5558> Finished [0] [3:results]
        │   │   └── PwCalculation<5561> Finished [0]
        │   ├── PwBaseWorkChain<5563> Finished [0] [3:results]
        │   │   └── PwCalculation<5566> Finished [0]
        │   ├── PwBaseWorkChain<5568> Finished [0] [3:results]
        │   │   └── PwCalculation<5571> Finished [0]
        │   ├── PwBaseWorkChain<5573> Finished [0] [3:results]
        │   │   └── PwCalculation<5576> Finished [0]
        │   ├── PwBaseWorkChain<5578> Finished [0] [3:results]
        │   │   └── PwCalculation<5581> Finished [0]
        │   ├── PwBaseWorkChain<5583> Finished [0] [3:results]
        │   │   └── PwCalculation<5586> Finished [0]
        │   ├── PwBaseWorkChain<5588> Finished [0] [3:results]
        │   │   └── PwCalculation<5591> Finished [0]
        │   ├── PwBaseWorkChain<5593> Finished [0] [3:results]
        │   │   └── PwCalculation<5613> Finished [0]
        │   ├── PwBaseWorkChain<5595> Finished [0] [3:results]
        │   │   └── PwCalculation<5606> Finished [0]
        │   ├── PwBaseWorkChain<5597> Finished [0] [3:results]
        │   │   └── PwCalculation<5600> Finished [0]
        │   └── generate_phonopy_data<5682> Finished [0]
        └── DielectricWorkChain<5518> Waiting [8:while_(should_run_electric_field_scfs)]
            ├── PwBaseWorkChain<5524> Finished [0] [3:results]
            │   └── PwCalculation<5530> Finished [0]
            ├── PwBaseWorkChain<5628> Finished [0] [3:results]
            │   └── PwCalculation<5631> Finished [0]
            ├── PwBaseWorkChain<5690> Finished [300] [2:while_(should_run_process)(2:inspect_process)]
            │   └── PwCalculation<5693> Finished [312]
            ├── PwBaseWorkChain<5696> Finished [300] [2:while_(should_run_process)(2:inspect_process)]
            │   └── PwCalculation<5699> Finished [312]
            ├── PwBaseWorkChain<5702> Finished [300] [2:while_(should_run_process)(2:inspect_process)]
            │   └── PwCalculation<5705> Finished [312]
            ├── PwBaseWorkChain<5708> Finished [300] [2:while_(should_run_process)(2:inspect_process)]
            │   └── PwCalculation<5711> Finished [312]
forrtl: severe (24): end-of-file during read, unit 4, file /lustre/scratch5/.mdt0/rkarkee/Runaiida/e3/a3/2edc-d231-40cb-9173-1fb022e66cd8/./out/aiida.save/wfc5.dat
Image              PC                Routine            Line        Source
pw.x               0000000001118795  Unknown               Unknown  Unknown
pw.x               0000000001116C00  Unknown               Unknown  Unknown
pw.x               0000000000B8FFE4  io_base_mp_read_w         357  io_base.f90
pw.x               0000000000573312  pw_restart_new_mp        1511  pw_restart_new.f90
pw.x               0000000000684FB0  wfcinit_                   89  wfcinit.f90
pw.x               00000000004AAC82  init_run_                 189  init_run.f90
pw.x               00000000005B300C  run_pwscf_                160  run_pwscf.f90
pw.x               0000000000411CE2  MAIN__                     85  pwscf.f90
pw.x               0000000000411B3D  Unknown               Unknown  Unknown
libc-2.31.so       0000145B535BB29D  __libc_start_main     Unknown  Unknown
pw.x               0000000000411A6A  Unknown               Unknown  Unknown
srun: error: nid003040: task 0: Exited with exit code 24
srun: Terminating StepId=10655535.0
slurmstepd: error: *** STEP 10655535.0 ON nid003040 CANCELLED AT 2024-04-02T15:53:42 ***

Hi @rkarkee,

did you manually check the output files of those calculations? They might contain more details which aren’t parsed, about why the calculation got interrupted.

You can do so by either going to the work directory (in case it hasn’t been cleaned)

verdi calcjob gotocomputer <pk>

or by printing the output in the shell using

verdi calcjob outputcat <pk>

<pk> needs to be replaced by the pk of one of your failed PwCalculations.

The stdout/stderr content shows a segfault so pw.x crashed hard. It seems that it crashed while trying to read the /lustre/scratch5/.mdt0/rkarkee/Runaiida/e3/a3/2edc-d231-40cb-9173-1fb022e66cd8/./out/aiida.save/wfc5.dat wavefunction file. You need to check where that file came from. If it came from a restart, check the previous calculation and make sure that finished correctly. Either the restart file was already corrupt (probably because the calc failed) or it was copied incorrectly.

Hi @sphuber

I am not sure where that wfc files came from, but in the
DielectricWorkChain, I can see only one input file which is not restart but uses startingpot and startingwfc from file. I am not sure is this pot and wfc files from previous pw calculations from PhononWorkChain. May be @bastonero Lorenzo can comment on this.

&ELECTRONS
  conv_thr =   2.0000000000d-12
  efield_cart(1) = 0
  efield_cart(2) = 0
  efield_cart(3) =   8.0000000000d-05
  efield_phase = 'read'
  electron_maxstep = 200
  mixing_beta =   3.0000000000d-01
  startingpot = 'file'
  startingwfc = 'file'

I also went to the folder /lustre/scratch5/.mdt0/rkarkee/Runaiida/ce/d5/7021-7fd6-48da-9861-cec6fa8ab1b9/./out/aiida.save/

I can see the following and only read xml file.

-rw------- 1 rkarkee rkarkee 54539384 Apr  2 13:18 wfc32.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr  2 13:18 wfc29.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr  2 13:18 wfc30.dat
-rw------- 1 rkarkee rkarkee 54539384 Apr  2 13:18 wfc31.dat
-rw------- 1 rkarkee rkarkee 54536416 Apr  2 13:18 wfc27.dat
-rw------- 1 rkarkee rkarkee 54536416 Apr  2 13:18 wfc28.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr  2 13:18 wfc24.dat
-rw------- 1 rkarkee rkarkee 54459248 Apr  2 13:18 wfc25.dat
-rw------- 1 rkarkee rkarkee 54459248 Apr  2 13:18 wfc26.dat
-rw------- 1 rkarkee rkarkee 54523060 Apr  2 13:17 wfc22.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr  2 13:17 wfc23.dat
-rw------- 1 rkarkee rkarkee 54508220 Apr  2 13:17 wfc19.dat
-rw------- 1 rkarkee rkarkee 54508220 Apr  2 13:17 wfc20.dat
-rw------- 1 rkarkee rkarkee 54523060 Apr  2 13:17 wfc21.dat
-rw------- 1 rkarkee rkarkee 54536416 Apr  2 13:17 wfc16.dat
-rw------- 1 rkarkee rkarkee 54536416 Apr  2 13:17 wfc17.dat
-rw------- 1 rkarkee rkarkee 54536416 Apr  2 13:17 wfc18.dat
-rw------- 1 rkarkee rkarkee 54508220 Apr  2 13:17 wfc14.dat
-rw------- 1 rkarkee rkarkee 54536416 Apr  2 13:17 wfc15.dat
-rw------- 1 rkarkee rkarkee 54523060 Apr  2 13:17 wfc11.dat
-rw------- 1 rkarkee rkarkee 54523060 Apr  2 13:17 wfc12.dat
-rw------- 1 rkarkee rkarkee 54508220 Apr  2 13:17 wfc13.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr  2 13:17 wfc10.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr  2 13:17 wfc9.dat
-rw------- 1 rkarkee rkarkee  1507328 Apr  2 13:17 wfc6.dat
-rw------- 1 rkarkee rkarkee 54459248 Apr  2 13:17 wfc7.dat
-rw------- 1 rkarkee rkarkee 54459248 Apr  2 13:17 wfc8.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr  2 13:17 wfc4.dat
-rw------- 1 rkarkee rkarkee 18874368 Apr  2 13:17 wfc5.dat
-rw------- 1 rkarkee rkarkee 54539384 Apr  2 13:17 wfc1.dat
-rw------- 1 rkarkee rkarkee 54539384 Apr  2 13:17 wfc2.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr  2 13:17 wfc3.dat
-rw------- 1 rkarkee rkarkee  8230744 Apr  2 13:17 charge-density.dat
-rw------- 1 rkarkee rkarkee   178694 Apr  2 13:17 data-file-schema.xml
-rw------- 1 rkarkee rkarkee   327011 Apr  2 13:17 Hf.upf
-rw------- 1 rkarkee rkarkee   222953 Apr  2 13:17 Te.upf

The .xml file has inputs as:

  <input>
    <control_variables>
      <title></title>
      <calculation>scf</calculation>
      <restart_mode>from_scratch</restart_mode>
      <prefix>aiida</prefix>
      <pseudo_dir>./pseudo/</pseudo_dir>
      <outdir>./out/</outdir>
      <stress>false</stress>
      <forces>true</forces>
      <wf_collect>true</wf_collect>
      <disk_io>medium</disk_io>
      <max_seconds>54720</max_seconds>
      <nstep>1</nstep>
      <etot_conv_thr>6.000000000000000E-005</etot_conv_thr>
      <forc_conv_thr>5.000000000000000E-005</forc_conv_thr>
      <press_conv_thr>5.000000000000000E-001</press_conv_thr>
      <verbosity>high</verbosity>
      <print_every>100000</print_every>
      <fcp>false</fcp>
      <rism>false</rism>
    </control_variables>
    <atomic_species ntyp="2">
      <species name="Hf">
        <mass>1.784900000000000E+002</mass>
        <pseudo_file>Hf.upf</pseudo_file>
      </species>
      <species name="Te">
        <mass>1.276000000000000E+002</mass>
        <pseudo_file>Te.upf</pseudo_file>
      </species>
    </atomic_species>
    <atomic_structure nat="12" alat="13.9997101029711">
      <atomic_positions>
        <atom name="Hf" index="1">-3.382609763080129E-007  8.536710778968260E+000  1.898954358172840E+001</atom>
        <atom name="Hf" index="2">-6.786006513531140E-007  1.845104035232113E+001  6.329847396459398E+000</atom>
        <atom name="Te" index="3">-4.541011877475726E-007  1.791185999228045E+001  1.898954358172840E+001</atom>
        <atom name="Te" index="4">-2.271450801800176E-007  9.075892920264780E+000  6.329847396459398E+000</atom>
        <atom name="Te" index="5">-9.070685398203695E-007  2.505429796490285E+001  2.160386980004620E+001</atom>
        <atom name="Te" index="6">-5.763664680108600E-008  1.933453166386534E+000  8.944174981994047E+000</atom>
        <atom name="Te" index="7">-9.070685398203695E-007  2.505429796490285E+001  1.637521554039181E+001</atom>
        <atom name="Te" index="8">-5.763664680108600E-008  1.933453166386534E+000  3.715520266537717E+000</atom>
        <atom name="Te" index="9">-6.812462679275902E-007  2.136630106312823E+001  1.096427137215388E+001</atom>
        <atom name="Te" index="10">-1.678076798667684E-007  5.621450513380632E+000  2.362396527879113E+001</atom>
        <atom name="Te" index="11">-6.812462679275902E-007  2.136630106312823E+001  1.695424230890503E+000</atom>
        <atom name="Te" index="12">-1.678076798667684E-007  5.621450513380632E+000  1.435511914267307E+001</atom>
      </atomic_positions>
      <cell>
        <a1>3.729224562850147E+000  1.349387887625589E+001  0.000000000000000E+000</a1>
        <a2>-3.729225314961145E+000  1.349387866460657E+001  0.000000000000000E+000</a2>
        <a3>0.000000000000000E+000  0.000000000000000E+000  2.531939323149723E+001</a3>
      </cell>
    </atomic_structure>
    <dft>
      <functional>PW</functional>
    </dft>
    <spin>
      <lsda>false</lsda>
      <noncolin>false</noncolin>
      <spinorbit>false</spinorbit>
    </spin>
    <bands>
      <tot_charge>0.000000000000000E+000</tot_charge>
      <occupations>fixed</occupations>
    </bands>
    <basis>
      <gamma_only>false</gamma_only>
      <ecutwfc>4.500000000000000E+001</ecutwfc>
      <ecutrho>1.800000000000000E+002</ecutrho>
    </basis>
    <electron_control>
      <diagonalization>davidson</diagonalization>
      <mixing_mode>plain</mixing_mode>
      <mixing_beta>3.000000000000000E-001</mixing_beta>
      <conv_thr>1.000000000000000E-012</conv_thr>
      <mixing_ndim>8</mixing_ndim>
      <max_nstep>200</max_nstep>
      <exx_nstep>100</exx_nstep>
      <real_space_q>false</real_space_q>
      <real_space_beta>false</real_space_beta>
      <tq_smoothing>false</tq_smoothing>
      <tbeta_smoothing>false</tbeta_smoothing>
      <diago_thr_init>0.000000000000000E+000</diago_thr_init>
      <diago_full_acc>false</diago_full_acc>
      <diago_cg_maxiter>20</diago_cg_maxiter>
      <diago_ppcg_maxiter>20</diago_ppcg_maxiter>
      <diago_rmm_ndim>4</diago_rmm_ndim>
      <diago_gs_nblock>16</diago_gs_nblock>
      <diago_rmm_conv>false</diago_rmm_conv>
    </electron_control>
    <k_points_IBZ>
      <monkhorst_pack nk1="4" nk2="4" nk3="2" k1="1" k2="1" k3="1">Uniform grid with offset</monkhorst_pack>
    </k_points_IBZ>
    <ion_control>
      <ion_dynamics>none</ion_dynamics>
      <upscale>1.000000000000000E+002</upscale>
      <remove_rigid_rot>false</remove_rigid_rot>
      <refold_pos>false</refold_pos>
    </ion_control>
    <cell_control>
      <cell_dynamics>none</cell_dynamics>
      <pressure>0.000000000000000E+000</pressure>
      <wmass>0.000000000000000E+000</wmass>
      <cell_do_free>all</cell_do_free>
    </cell_control>
    <symmetry_flags>
      <nosym>false</nosym>
      <nosym_evc>false</nosym_evc>
      <noinv>false</noinv>
      <no_t_rev>false</no_t_rev>
      <force_symmorphic>false</force_symmorphic>
      <use_all_frac>false</use_all_frac>
    </symmetry_flags>
    <electric_field>
      <electric_potential>homogenous_field</electric_potential>
      <dipole_correction>false</dipole_correction>
      <potential_max_position>5.000000000000000E-001</potential_max_position>
      <potential_decrease_width>1.000000000000000E-001</potential_decrease_width>
      <electric_field_amplitude>0.000000000000000E+000</electric_field_amplitude>
      <electric_field_vector>0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000</electric_field_vector>
      <nk_per_string>0</nk_per_string>
      <n_berry_cycles>1</n_berry_cycles>
    </electric_field>
  </input>
  <output>
    <convergence_info>
      <scf_conv>
        <convergence_achieved>true</convergence_achieved>
        <n_scf_steps>10</n_scf_steps>
        <scf_error>8.100335246331452E-013</scf_error>
      </scf_conv>
    </convergence_info>
    <algorithmic_info>
      <real_space_q>false</real_space_q>
      <real_space_beta>false</real_space_beta>
      <uspp>false</uspp>
      <paw>false</paw>
    </algorithmic_info>
    <atomic_species ntyp="2" pseudo_dir="./pseudo/">
      <species name="Hf">
        <mass>1.784900000000000E+002</mass>
        <pseudo_file>Hf.upf</pseudo_file>
      </species>
      <species name="Te">
        <mass>1.276000000000000E+002</mass>
        <pseudo_file>Te.upf</pseudo_file>
      </species>
    </atomic_species>
    <atomic_structure nat="12" alat="13.9997101029711">
      <atomic_positions>
        <atom name="Hf" index="1">-3.382609763080129E-007  8.536710778968260E+000  1.898954358172840E+001</atom>
        <atom name="Hf" index="2">-6.786006513531140E-007  1.845104035232113E+001  6.329847396459398E+000</atom>
        <atom name="Te" index="3">-4.541011877475726E-007  1.791185999228045E+001  1.898954358172840E+001</atom>
        <atom name="Te" index="4">-2.271450801800176E-007  9.075892920264780E+000  6.329847396459398E+000</atom>
        <atom name="Te" index="5">-9.070685398203695E-007  2.505429796490285E+001  2.160386980004620E+001</atom>
        <atom name="Te" index="6">-5.763664680108600E-008  1.933453166386534E+000  8.944174981994047E+000</atom>
        <atom name="Te" index="7">-9.070685398203695E-007  2.505429796490285E+001  1.637521554039181E+001</atom>
        <atom name="Te" index="8">-5.763664680108600E-008  1.933453166386534E+000  31525235E-001

As Sebatsiaan already mentioned, it seems indeed a problem with the reading of a wfc. The DielectricWorkChain, as well as the PhononWorkChain, explout previous SCF wfc and charge density (rho) as a starting point of the calculations with perturbations (i.e. electric fields, atomic displacements). As such, either something went wrong during the copy/paste of wfc/rho, or something is being compiled in sub-optimal way. Usually restart from wavefunctions can be problematic. I would suggest to e.g. compile QE using HDF5 (which will also speed up IO operations). But for this, I guess it would be better to ask to QE mailing list, as we can provide only limited feedback here on such QE details.

Its weird that happened because it was working fine in previous calculations.

How do I clean up the information stored (apart from working directory)?

Also,

I just compiled QE with HDF5.

Is there anything I need to do overrides.yaml or aiida itself for working with HDF5 kind or everything is same?

Best
Rijan

Ok, if everything was running smoothly before, then it’s likely to be a problem of the cluster or related to some particular conditions (maybe you are running out of storage?). So if you want to clean up the working directories, you can either use the Verdi command “verdi calcjob cleanworkdir OPTIONS” (to see the OPTIONS put —help in front). Otherwise, if nothing else is running, you can also delete manually everything that there is in the aiida folder on the cluster (I guess in your case /lustre/scratch5/.mdt0/rkarkee/Runaiida).

Moreover, since the PhononWorkChain ran smoothly, I suggest you to read about the caching mechanism of Aiida. It will save you lots of resources, and it is basically a restart mechanism, so that calculations that ended correctly will not be redone. Although, some care must be taken for these workflows where restarts from previous calculations are done (that means, be aware of not removing ”critical” folders). Since the PhononWorkChain and the DielectricWorkChain are independent, usually this mechanism is rather powerful and convenient. It is also very easy to activate (see aiida-core docs). It just requires a bit of practice to understand the logic (also, you will notice it is rather sensitive to any change in inputs you make).

For HDF5, nothing changes at the aiida level, fortunately