Hi everyone,
I am running IRamanSpectraWorkChain of aiida-vibroscopy. In DielectricWorkChain, step the pw.x calculation failed with the following error (end of file read error). I am not sure if I should post this in QE community, I think this is more related due to aiida workflow, as my input file for normal pw.x calculation runs smoothly. I am not sure cause of this error.
IRamanSpectraWorkChain<5509> Waiting [None]
└── HarmonicWorkChain<5511> Waiting [1:if_(should_run_parallel)]
├── generate_preprocess_data<5512> Finished [0]
├── PhononWorkChain<5514> Finished [0] [7:if_(should_run_phonopy)]
│ ├── generate_preprocess_data<5515> Finished [0]
│ ├── get_supercell<5517> Finished [0]
│ ├── PwBaseWorkChain<5521> Finished [0] [3:results]
│ │ └── PwCalculation<5527> Finished [0]
│ ├── get_supercells_with_displacements<5537> Finished [0]
│ ├── PwBaseWorkChain<5551> Finished [0] [3:results]
│ │ └── PwCalculation<5554> Finished [0]
│ ├── PwBaseWorkChain<5556> Finished [0] [3:results]
│ │ └── PwCalculation<5603> Finished [0]
│ ├── PwBaseWorkChain<5558> Finished [0] [3:results]
│ │ └── PwCalculation<5561> Finished [0]
│ ├── PwBaseWorkChain<5563> Finished [0] [3:results]
│ │ └── PwCalculation<5566> Finished [0]
│ ├── PwBaseWorkChain<5568> Finished [0] [3:results]
│ │ └── PwCalculation<5571> Finished [0]
│ ├── PwBaseWorkChain<5573> Finished [0] [3:results]
│ │ └── PwCalculation<5576> Finished [0]
│ ├── PwBaseWorkChain<5578> Finished [0] [3:results]
│ │ └── PwCalculation<5581> Finished [0]
│ ├── PwBaseWorkChain<5583> Finished [0] [3:results]
│ │ └── PwCalculation<5586> Finished [0]
│ ├── PwBaseWorkChain<5588> Finished [0] [3:results]
│ │ └── PwCalculation<5591> Finished [0]
│ ├── PwBaseWorkChain<5593> Finished [0] [3:results]
│ │ └── PwCalculation<5613> Finished [0]
│ ├── PwBaseWorkChain<5595> Finished [0] [3:results]
│ │ └── PwCalculation<5606> Finished [0]
│ ├── PwBaseWorkChain<5597> Finished [0] [3:results]
│ │ └── PwCalculation<5600> Finished [0]
│ └── generate_phonopy_data<5682> Finished [0]
└── DielectricWorkChain<5518> Waiting [8:while_(should_run_electric_field_scfs)]
├── PwBaseWorkChain<5524> Finished [0] [3:results]
│ └── PwCalculation<5530> Finished [0]
├── PwBaseWorkChain<5628> Finished [0] [3:results]
│ └── PwCalculation<5631> Finished [0]
├── PwBaseWorkChain<5690> Finished [300] [2:while_(should_run_process)(2:inspect_process)]
│ └── PwCalculation<5693> Finished [312]
├── PwBaseWorkChain<5696> Finished [300] [2:while_(should_run_process)(2:inspect_process)]
│ └── PwCalculation<5699> Finished [312]
├── PwBaseWorkChain<5702> Finished [300] [2:while_(should_run_process)(2:inspect_process)]
│ └── PwCalculation<5705> Finished [312]
├── PwBaseWorkChain<5708> Finished [300] [2:while_(should_run_process)(2:inspect_process)]
│ └── PwCalculation<5711> Finished [312]
forrtl: severe (24): end-of-file during read, unit 4, file /lustre/scratch5/.mdt0/rkarkee/Runaiida/e3/a3/2edc-d231-40cb-9173-1fb022e66cd8/./out/aiida.save/wfc5.dat
Image PC Routine Line Source
pw.x 0000000001118795 Unknown Unknown Unknown
pw.x 0000000001116C00 Unknown Unknown Unknown
pw.x 0000000000B8FFE4 io_base_mp_read_w 357 io_base.f90
pw.x 0000000000573312 pw_restart_new_mp 1511 pw_restart_new.f90
pw.x 0000000000684FB0 wfcinit_ 89 wfcinit.f90
pw.x 00000000004AAC82 init_run_ 189 init_run.f90
pw.x 00000000005B300C run_pwscf_ 160 run_pwscf.f90
pw.x 0000000000411CE2 MAIN__ 85 pwscf.f90
pw.x 0000000000411B3D Unknown Unknown Unknown
libc-2.31.so 0000145B535BB29D __libc_start_main Unknown Unknown
pw.x 0000000000411A6A Unknown Unknown Unknown
srun: error: nid003040: task 0: Exited with exit code 24
srun: Terminating StepId=10655535.0
slurmstepd: error: *** STEP 10655535.0 ON nid003040 CANCELLED AT 2024-04-02T15:53:42 ***
Hi @rkarkee,
did you manually check the output files of those calculations? They might contain more details which aren’t parsed, about why the calculation got interrupted.
You can do so by either going to the work directory (in case it hasn’t been cleaned)
verdi calcjob gotocomputer <pk>
or by printing the output in the shell using
verdi calcjob outputcat <pk>
<pk>
needs to be replaced by the pk
of one of your failed PwCalculation
s.
The stdout/stderr content shows a segfault so pw.x crashed hard. It seems that it crashed while trying to read the /lustre/scratch5/.mdt0/rkarkee/Runaiida/e3/a3/2edc-d231-40cb-9173-1fb022e66cd8/./out/aiida.save/wfc5.dat
wavefunction file. You need to check where that file came from. If it came from a restart, check the previous calculation and make sure that finished correctly. Either the restart file was already corrupt (probably because the calc failed) or it was copied incorrectly.
Hi @sphuber
I am not sure where that wfc files came from, but in the
DielectricWorkChain, I can see only one input file which is not restart but uses startingpot and startingwfc from file. I am not sure is this pot and wfc files from previous pw calculations from PhononWorkChain. May be @bastonero Lorenzo can comment on this.
&ELECTRONS
conv_thr = 2.0000000000d-12
efield_cart(1) = 0
efield_cart(2) = 0
efield_cart(3) = 8.0000000000d-05
efield_phase = 'read'
electron_maxstep = 200
mixing_beta = 3.0000000000d-01
startingpot = 'file'
startingwfc = 'file'
I also went to the folder /lustre/scratch5/.mdt0/rkarkee/Runaiida/ce/d5/7021-7fd6-48da-9861-cec6fa8ab1b9/./out/aiida.save/
I can see the following and only read xml file.
-rw------- 1 rkarkee rkarkee 54539384 Apr 2 13:18 wfc32.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr 2 13:18 wfc29.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr 2 13:18 wfc30.dat
-rw------- 1 rkarkee rkarkee 54539384 Apr 2 13:18 wfc31.dat
-rw------- 1 rkarkee rkarkee 54536416 Apr 2 13:18 wfc27.dat
-rw------- 1 rkarkee rkarkee 54536416 Apr 2 13:18 wfc28.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr 2 13:18 wfc24.dat
-rw------- 1 rkarkee rkarkee 54459248 Apr 2 13:18 wfc25.dat
-rw------- 1 rkarkee rkarkee 54459248 Apr 2 13:18 wfc26.dat
-rw------- 1 rkarkee rkarkee 54523060 Apr 2 13:17 wfc22.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr 2 13:17 wfc23.dat
-rw------- 1 rkarkee rkarkee 54508220 Apr 2 13:17 wfc19.dat
-rw------- 1 rkarkee rkarkee 54508220 Apr 2 13:17 wfc20.dat
-rw------- 1 rkarkee rkarkee 54523060 Apr 2 13:17 wfc21.dat
-rw------- 1 rkarkee rkarkee 54536416 Apr 2 13:17 wfc16.dat
-rw------- 1 rkarkee rkarkee 54536416 Apr 2 13:17 wfc17.dat
-rw------- 1 rkarkee rkarkee 54536416 Apr 2 13:17 wfc18.dat
-rw------- 1 rkarkee rkarkee 54508220 Apr 2 13:17 wfc14.dat
-rw------- 1 rkarkee rkarkee 54536416 Apr 2 13:17 wfc15.dat
-rw------- 1 rkarkee rkarkee 54523060 Apr 2 13:17 wfc11.dat
-rw------- 1 rkarkee rkarkee 54523060 Apr 2 13:17 wfc12.dat
-rw------- 1 rkarkee rkarkee 54508220 Apr 2 13:17 wfc13.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr 2 13:17 wfc10.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr 2 13:17 wfc9.dat
-rw------- 1 rkarkee rkarkee 1507328 Apr 2 13:17 wfc6.dat
-rw------- 1 rkarkee rkarkee 54459248 Apr 2 13:17 wfc7.dat
-rw------- 1 rkarkee rkarkee 54459248 Apr 2 13:17 wfc8.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr 2 13:17 wfc4.dat
-rw------- 1 rkarkee rkarkee 18874368 Apr 2 13:17 wfc5.dat
-rw------- 1 rkarkee rkarkee 54539384 Apr 2 13:17 wfc1.dat
-rw------- 1 rkarkee rkarkee 54539384 Apr 2 13:17 wfc2.dat
-rw------- 1 rkarkee rkarkee 54493380 Apr 2 13:17 wfc3.dat
-rw------- 1 rkarkee rkarkee 8230744 Apr 2 13:17 charge-density.dat
-rw------- 1 rkarkee rkarkee 178694 Apr 2 13:17 data-file-schema.xml
-rw------- 1 rkarkee rkarkee 327011 Apr 2 13:17 Hf.upf
-rw------- 1 rkarkee rkarkee 222953 Apr 2 13:17 Te.upf
The .xml file has inputs as:
<input>
<control_variables>
<title></title>
<calculation>scf</calculation>
<restart_mode>from_scratch</restart_mode>
<prefix>aiida</prefix>
<pseudo_dir>./pseudo/</pseudo_dir>
<outdir>./out/</outdir>
<stress>false</stress>
<forces>true</forces>
<wf_collect>true</wf_collect>
<disk_io>medium</disk_io>
<max_seconds>54720</max_seconds>
<nstep>1</nstep>
<etot_conv_thr>6.000000000000000E-005</etot_conv_thr>
<forc_conv_thr>5.000000000000000E-005</forc_conv_thr>
<press_conv_thr>5.000000000000000E-001</press_conv_thr>
<verbosity>high</verbosity>
<print_every>100000</print_every>
<fcp>false</fcp>
<rism>false</rism>
</control_variables>
<atomic_species ntyp="2">
<species name="Hf">
<mass>1.784900000000000E+002</mass>
<pseudo_file>Hf.upf</pseudo_file>
</species>
<species name="Te">
<mass>1.276000000000000E+002</mass>
<pseudo_file>Te.upf</pseudo_file>
</species>
</atomic_species>
<atomic_structure nat="12" alat="13.9997101029711">
<atomic_positions>
<atom name="Hf" index="1">-3.382609763080129E-007 8.536710778968260E+000 1.898954358172840E+001</atom>
<atom name="Hf" index="2">-6.786006513531140E-007 1.845104035232113E+001 6.329847396459398E+000</atom>
<atom name="Te" index="3">-4.541011877475726E-007 1.791185999228045E+001 1.898954358172840E+001</atom>
<atom name="Te" index="4">-2.271450801800176E-007 9.075892920264780E+000 6.329847396459398E+000</atom>
<atom name="Te" index="5">-9.070685398203695E-007 2.505429796490285E+001 2.160386980004620E+001</atom>
<atom name="Te" index="6">-5.763664680108600E-008 1.933453166386534E+000 8.944174981994047E+000</atom>
<atom name="Te" index="7">-9.070685398203695E-007 2.505429796490285E+001 1.637521554039181E+001</atom>
<atom name="Te" index="8">-5.763664680108600E-008 1.933453166386534E+000 3.715520266537717E+000</atom>
<atom name="Te" index="9">-6.812462679275902E-007 2.136630106312823E+001 1.096427137215388E+001</atom>
<atom name="Te" index="10">-1.678076798667684E-007 5.621450513380632E+000 2.362396527879113E+001</atom>
<atom name="Te" index="11">-6.812462679275902E-007 2.136630106312823E+001 1.695424230890503E+000</atom>
<atom name="Te" index="12">-1.678076798667684E-007 5.621450513380632E+000 1.435511914267307E+001</atom>
</atomic_positions>
<cell>
<a1>3.729224562850147E+000 1.349387887625589E+001 0.000000000000000E+000</a1>
<a2>-3.729225314961145E+000 1.349387866460657E+001 0.000000000000000E+000</a2>
<a3>0.000000000000000E+000 0.000000000000000E+000 2.531939323149723E+001</a3>
</cell>
</atomic_structure>
<dft>
<functional>PW</functional>
</dft>
<spin>
<lsda>false</lsda>
<noncolin>false</noncolin>
<spinorbit>false</spinorbit>
</spin>
<bands>
<tot_charge>0.000000000000000E+000</tot_charge>
<occupations>fixed</occupations>
</bands>
<basis>
<gamma_only>false</gamma_only>
<ecutwfc>4.500000000000000E+001</ecutwfc>
<ecutrho>1.800000000000000E+002</ecutrho>
</basis>
<electron_control>
<diagonalization>davidson</diagonalization>
<mixing_mode>plain</mixing_mode>
<mixing_beta>3.000000000000000E-001</mixing_beta>
<conv_thr>1.000000000000000E-012</conv_thr>
<mixing_ndim>8</mixing_ndim>
<max_nstep>200</max_nstep>
<exx_nstep>100</exx_nstep>
<real_space_q>false</real_space_q>
<real_space_beta>false</real_space_beta>
<tq_smoothing>false</tq_smoothing>
<tbeta_smoothing>false</tbeta_smoothing>
<diago_thr_init>0.000000000000000E+000</diago_thr_init>
<diago_full_acc>false</diago_full_acc>
<diago_cg_maxiter>20</diago_cg_maxiter>
<diago_ppcg_maxiter>20</diago_ppcg_maxiter>
<diago_rmm_ndim>4</diago_rmm_ndim>
<diago_gs_nblock>16</diago_gs_nblock>
<diago_rmm_conv>false</diago_rmm_conv>
</electron_control>
<k_points_IBZ>
<monkhorst_pack nk1="4" nk2="4" nk3="2" k1="1" k2="1" k3="1">Uniform grid with offset</monkhorst_pack>
</k_points_IBZ>
<ion_control>
<ion_dynamics>none</ion_dynamics>
<upscale>1.000000000000000E+002</upscale>
<remove_rigid_rot>false</remove_rigid_rot>
<refold_pos>false</refold_pos>
</ion_control>
<cell_control>
<cell_dynamics>none</cell_dynamics>
<pressure>0.000000000000000E+000</pressure>
<wmass>0.000000000000000E+000</wmass>
<cell_do_free>all</cell_do_free>
</cell_control>
<symmetry_flags>
<nosym>false</nosym>
<nosym_evc>false</nosym_evc>
<noinv>false</noinv>
<no_t_rev>false</no_t_rev>
<force_symmorphic>false</force_symmorphic>
<use_all_frac>false</use_all_frac>
</symmetry_flags>
<electric_field>
<electric_potential>homogenous_field</electric_potential>
<dipole_correction>false</dipole_correction>
<potential_max_position>5.000000000000000E-001</potential_max_position>
<potential_decrease_width>1.000000000000000E-001</potential_decrease_width>
<electric_field_amplitude>0.000000000000000E+000</electric_field_amplitude>
<electric_field_vector>0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000</electric_field_vector>
<nk_per_string>0</nk_per_string>
<n_berry_cycles>1</n_berry_cycles>
</electric_field>
</input>
<output>
<convergence_info>
<scf_conv>
<convergence_achieved>true</convergence_achieved>
<n_scf_steps>10</n_scf_steps>
<scf_error>8.100335246331452E-013</scf_error>
</scf_conv>
</convergence_info>
<algorithmic_info>
<real_space_q>false</real_space_q>
<real_space_beta>false</real_space_beta>
<uspp>false</uspp>
<paw>false</paw>
</algorithmic_info>
<atomic_species ntyp="2" pseudo_dir="./pseudo/">
<species name="Hf">
<mass>1.784900000000000E+002</mass>
<pseudo_file>Hf.upf</pseudo_file>
</species>
<species name="Te">
<mass>1.276000000000000E+002</mass>
<pseudo_file>Te.upf</pseudo_file>
</species>
</atomic_species>
<atomic_structure nat="12" alat="13.9997101029711">
<atomic_positions>
<atom name="Hf" index="1">-3.382609763080129E-007 8.536710778968260E+000 1.898954358172840E+001</atom>
<atom name="Hf" index="2">-6.786006513531140E-007 1.845104035232113E+001 6.329847396459398E+000</atom>
<atom name="Te" index="3">-4.541011877475726E-007 1.791185999228045E+001 1.898954358172840E+001</atom>
<atom name="Te" index="4">-2.271450801800176E-007 9.075892920264780E+000 6.329847396459398E+000</atom>
<atom name="Te" index="5">-9.070685398203695E-007 2.505429796490285E+001 2.160386980004620E+001</atom>
<atom name="Te" index="6">-5.763664680108600E-008 1.933453166386534E+000 8.944174981994047E+000</atom>
<atom name="Te" index="7">-9.070685398203695E-007 2.505429796490285E+001 1.637521554039181E+001</atom>
<atom name="Te" index="8">-5.763664680108600E-008 1.933453166386534E+000 31525235E-001
As Sebatsiaan already mentioned, it seems indeed a problem with the reading of a wfc. The DielectricWorkChain, as well as the PhononWorkChain, explout previous SCF wfc and charge density (rho) as a starting point of the calculations with perturbations (i.e. electric fields, atomic displacements). As such, either something went wrong during the copy/paste of wfc/rho, or something is being compiled in sub-optimal way. Usually restart from wavefunctions can be problematic. I would suggest to e.g. compile QE using HDF5 (which will also speed up IO operations). But for this, I guess it would be better to ask to QE mailing list, as we can provide only limited feedback here on such QE details.
Its weird that happened because it was working fine in previous calculations.
How do I clean up the information stored (apart from working directory)?
Also,
I just compiled QE with HDF5.
Is there anything I need to do overrides.yaml or aiida itself for working with HDF5 kind or everything is same?
Best
Rijan
Ok, if everything was running smoothly before, then it’s likely to be a problem of the cluster or related to some particular conditions (maybe you are running out of storage?). So if you want to clean up the working directories, you can either use the Verdi command “verdi calcjob cleanworkdir OPTIONS” (to see the OPTIONS put —help in front). Otherwise, if nothing else is running, you can also delete manually everything that there is in the aiida folder on the cluster (I guess in your case /lustre/scratch5/.mdt0/rkarkee/Runaiida).
Moreover, since the PhononWorkChain ran smoothly, I suggest you to read about the caching mechanism of Aiida. It will save you lots of resources, and it is basically a restart mechanism, so that calculations that ended correctly will not be redone. Although, some care must be taken for these workflows where restarts from previous calculations are done (that means, be aware of not removing ”critical” folders). Since the PhononWorkChain and the DielectricWorkChain are independent, usually this mechanism is rather powerful and convenient. It is also very easy to activate (see aiida-core docs). It just requires a bit of practice to understand the logic (also, you will notice it is rather sensitive to any change in inputs you make).
For HDF5, nothing changes at the aiida level, fortunately