Dear all,
I tried to run /local/th1/DFT/fleur_MPI_MaXR5_th1 on a node of the th1 partitions and got a Segmentation Fault and an MPI_Abort. The node log does not show any out-of-memory errors for this run. This is the Slurm error log:
I/O warning : failed to load external entity "relax.xml"
Signal 11 detected on PE: 9
This might be due to either:
- A bug
- Your job running out of memory
- Your job got killed externally (e.g. no cpu-time left)
- ....
Please check and report if you believe you found a bug
Abort(0) on node 9 (rank 9 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 9
Signal 11 detected on PE: 6
This might be due to either:
- A bug
- Your job running out of memory
- Your job got killed externally (e.g. no cpu-time left)
- ....
Please check and report if you believe you found a bug
Abort(0) on node 6 (rank 6 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 6
Signal 11 detected on PE: 7
This might be due to either:
- A bug
- Your job running out of memory
- Your job got killed externally (e.g. no cpu-time left)
- ....
Please check and report if you believe you found a bug
Abort(0) on node 7 (rank 7 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 7
Signal 11 detected on PE: 8
This might be due to either:
- A bug
- Your job running out of memory
- Your job got killed externally (e.g. no cpu-time left)
- ....
Please check and report if you believe you found a bug
Abort(0) on node 8 (rank 8 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 8
Signal 11 detected on PE: 10
This might be due to either:
- A bug
- Your job running out of memory
- Your job got killed externally (e.g. no cpu-time left)
- ....
Please check and report if you believe you found a bug
Abort(0) on node 10 (rank 10 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 10
Signal 11 detected on PE: 11
This might be due to either:
- A bug
- Your job running out of memory
- Your job got killed externally (e.g. no cpu-time left)
- ....
Please check and report if you believe you found a bug
Abort(0) on node 11 (rank 11 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 11
Signal
Signal
Signal
Signal
Signal
Signal
slurmstepd: error: *** STEP 3218157.0 ON iffcluster0702 CANCELLED AT 2021-10-27T10:26:05 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: iffcluster0702: task 7: Killed
srun: error: iffcluster0702: tasks 1-2,4,6,10-11: Killed
srun: error: iffcluster0702: tasks 0,3,5,8-9: Killed
For the primitive cell (22 atoms), it works pretty well. But for the supercell (88 atoms) has such this error.
Could you please tell me how can I solve this error?
Best,
Mohammad