Job Signals

Slurm supports sending signals to running jobs before the timelimit is reached. This can be used save the current state of calculations and copy any checkpointing data from local storage to the home or scratch file system, thus saving the results of the jobs from being lost.

Sending the Signal

This signals can be controlled with the --signal sbatch parameter. The syntax is as follows:

--signal=B:<sig_num>@<sig_time>

You can find all options in the sbatch manpage. This will send the signal with the number <sig_num> at <sig_time> (in seconds) before the timelimit is reached. To put it into a concrete example:

--signal=B:12@600

This sends the signal '12' (aka SIGUSR2, most likely not used in your program) to the batch job (and all its processes) 10 minutes before the job will run into the timelimit.

Trapping the Signal

But just sending a signal is not enough, the job needs to know what to do with the signal. The easiest way to do that is to use a trap. This command can be used to define steps that should be taken, when the signal is received, for example:

trap 'mkdir -p ${HOME}/job_${SLURM_JOBID}; cp -af ${TMP_LOCAL}/* ${HOME}/job_${SLURM_JOBID}/; exit 12' 12

This will trap the signal 12 and run the command given commands to create a folder in the home directory with the JobID of the job and copy all files from the local disk (located at $TMP_LOCAL) into this directory. Some more examples of using signals and traps can be found here.

If you have a multi-node job, you will have to use srun to run the copy command on all nodes of the job:

trap 'mkdir -p ${HOME}/job_${SLURM_JOBID}; srun -n ${SLURM_JOB_NUM_NODES} --ntasks-per-node=1 cp -af ${TMP_LOCAL}/* ${HOME}/job_${SLURM_JOBID}/; exit 12' 12

One more modification is necessary: The trap command will wait for the currently running process to finish until it is executed1). As this is not intended in this case (as the timelimit approaches), the calculations have to be started in the background, for example:

./long_calculation.py &
wait

The python program will run in the background, and the wait waits for it to finish while still allowing the trap to be executed.

Example

#!/bin/bash
#SBATCH -p medium
#SBATCH -t 24:00:00
#SBATCH -c 10
#SBATCH -N 1
#SBATCH --signal=B:12@600
 
module load python
cd $TMP_LOCAL
 
trap 'mkdir -p ${HOME}/job_${SLURM_JOBID}; cp -af ${TMP_LOCAL}/* ${HOME}/job_${SLURM_JOBID}/; exit 12' 12
 
./big_calculation.py &
wait
This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website.More information about cookies