Slurm

Getting Started with Slurm in ExCL with best practice recommendations.

Getting Started with Slurm

For general Slurm use and reference, see Job Submission with Slurm - CADES User Documentation, CADES Cheat Sheet, and Cheat Sheet | ExCL User Docs.

sbatch Template

Here is a recommended Slurm sbatch script template:

#!/bin/bash

#SBATCH --job-name=test
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --output=slurm-test-out.txt
#SBATCH --error=slurm-test-err.txt
#SBATCH --partition=compute
#SBATCH --nodelist=justify

GIT_ROOT=$(git rev-parse --show-toplevel)
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

echo Started: $(date)
echo Host: $(hostname)
echo Path: $(pwd)
echo --------------------------------------------------------------------------------

### Setup Environment

### Run Command
echo Run Task

echo --------------------------------------------------------------------------------
echo Finished: $(date)

Using the Preemptable GPU Queue (`nvidia-long`)

The nvidia-long queue provides opportunistic GPU access for long-running or flexible workloads. Jobs in this queue may be preempted when higher-priority work arrives.

Preemption allows better overall cluster utilization, but your job must be written to tolerate interruption.

What “Preemptable” Means

Jobs submitted to nvidia-long:

Run when GPUs are idle
May be stopped at any time
Are requeued instead of canceled only if you request it
May restart multiple times

The normal GPU queue (nvidia) is not preempted.

Important Cluster Policy

The cluster is configured with:

PreemptType=preempt/partition_prio
PreemptMode=REQUEUE
JobRequeue=0

What this means for users

Preemption is enabled
Preempted jobs are eligible to be requeued
Jobs are NOT requeueable by default
You must explicitly request requeue behavior

If you do not request requeue:

Your job will be canceled when preempted
Partial work will be lost

Submitting a Preemptable Job

Basic submission

sbatch -p nvidia-long my_job.sh

⚠️ This job will be canceled on preemption unless you also add --requeue.

Making Your Job Requeueable (REQUIRED)

To allow your job to be restarted after preemption, you must include:

#SBATCH --requeue

Example job script

#!/bin/bash
#SBATCH -p nvidia-long
#SBATCH --requeue
#SBATCH --job-name=my-long-job
#SBATCH --time=7-00:00:00
#SBATCH --gres=gpu:1

./run_my_workload.sh

Handling Preemption Gracefully (Strongly Recommended)

When a job is preempted, Slurm sends a SIGTERM signal before stopping it.

You should:

Catch the signal
Save checkpoints
Exit cleanly

Example with signal handling

#!/bin/bash
#SBATCH -p nvidia-long
#SBATCH --requeue
#SBATCH --signal=B:TERM@60   # 60 seconds warning

checkpoint() {
    echo "Preempted – saving state..."
    ./save_checkpoint.sh
    exit 0
}

trap checkpoint SIGTERM

./run_my_workload.sh

This gives your job 60 seconds to checkpoint before requeue.

How Requeued Jobs Behave

When a requeueable job is preempted:

It transitions to PENDING (BeginTime)
Slurm may delay restart briefly to avoid thrashing
scontrol show job will show:
- Restarts=1 (or higher)
- Requeue=1

This is expected behavior.

Monitoring Preemptable Jobs

Check job status

squeue -j <jobid>

See restart count

scontrol show job <jobid> | grep -i restart

View job history

sacct -j <jobid> --format=JobID,State,Reason,Elapsed

When to Use `nvidia-long`

Good fits

Parameter sweeps
Model training with checkpoints
Monte Carlo workloads
Long experiments without strict deadlines

Not recommended

Jobs that cannot restart
Short, latency-sensitive runs
Jobs without checkpointing

Summary

Queue

Preemptable

Requires --requeue

Intended Use

nvidia

❌ No

N/A

Normal GPU jobs

nvidia-long

✅ Yes

Opportunistic / long jobs

Key takeaways

✔ nvidia-long jobs can be interrupted
✔ Always use --requeue if you want your job restarted
✔ Handle SIGTERM and checkpoint frequently
✔ Expect jobs to restart multiple times

PreviousSiemens EDA NextThinLinc

Last updated 1 month ago

Was this helpful?

hashtagGetting Started with Slurm

hashtagsbatch Template

hashtagUsing the Preemptable GPU Queue (nvidia-long)

hashtagWhat “Preemptable” Means

hashtagImportant Cluster Policy

hashtagWhat this means for users

hashtagSubmitting a Preemptable Job

hashtagBasic submission

hashtagMaking Your Job Requeueable (REQUIRED)

hashtagExample job script

hashtagHandling Preemption Gracefully (Strongly Recommended)

hashtagExample with signal handling

hashtagHow Requeued Jobs Behave

hashtagMonitoring Preemptable Jobs

hashtagCheck job status

hashtagSee restart count

hashtagView job history

hashtagWhen to Use nvidia-long

hashtagSummary

hashtagKey takeaways