Slurm
Getting Started with Slurm in ExCL with best practice recommendations.
Getting Started with Slurm
For general Slurm use and reference, see Job Submission with Slurm - CADES User Documentation, CADES Cheat Sheet, and Cheat Sheet | ExCL User Docs.
sbatch Template
Here is a recommended Slurm sbatch script template:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --output=slurm-test-out.txt
#SBATCH --error=slurm-test-err.txt
#SBATCH --partition=compute
#SBATCH --nodelist=justify
GIT_ROOT=$(git rev-parse --show-toplevel)
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
echo Started: $(date)
echo Host: $(hostname)
echo Path: $(pwd)
echo --------------------------------------------------------------------------------
### Setup Environment
### Run Command
echo Run Task
echo --------------------------------------------------------------------------------
echo Finished: $(date)Using the Preemptable GPU Queue (nvidia-long)
nvidia-long)The nvidia-long queue provides opportunistic GPU access for long-running or flexible workloads. Jobs in this queue may be preempted when higher-priority work arrives.
Preemption allows better overall cluster utilization, but your job must be written to tolerate interruption.
What “Preemptable” Means
Jobs submitted to nvidia-long:
Run when GPUs are idle
May be stopped at any time
Are requeued instead of canceled only if you request it
May restart multiple times
The normal GPU queue (nvidia) is not preempted.
Important Cluster Policy
The cluster is configured with:
What this means for users
Preemption is enabled
Preempted jobs are eligible to be requeued
Jobs are NOT requeueable by default
You must explicitly request requeue behavior
If you do not request requeue:
Your job will be canceled when preempted
Partial work will be lost
Submitting a Preemptable Job
Basic submission
⚠️ This job will be canceled on preemption unless you also add --requeue.
Making Your Job Requeueable (REQUIRED)
To allow your job to be restarted after preemption, you must include:
Example job script
Handling Preemption Gracefully (Strongly Recommended)
When a job is preempted, Slurm sends a SIGTERM signal before stopping it.
You should:
Catch the signal
Save checkpoints
Exit cleanly
Example with signal handling
This gives your job 60 seconds to checkpoint before requeue.
How Requeued Jobs Behave
When a requeueable job is preempted:
It transitions to
PENDING (BeginTime)Slurm may delay restart briefly to avoid thrashing
scontrol show jobwill show:Restarts=1(or higher)Requeue=1
This is expected behavior.
Monitoring Preemptable Jobs
Check job status
See restart count
View job history
When to Use nvidia-long
nvidia-longGood fits
Parameter sweeps
Model training with checkpoints
Monte Carlo workloads
Long experiments without strict deadlines
Not recommended
Jobs that cannot restart
Short, latency-sensitive runs
Jobs without checkpointing
Summary
Queue
Preemptable
Requires --requeue
Intended Use
nvidia
❌ No
N/A
Normal GPU jobs
nvidia-long
✅ Yes
✅ Yes
Opportunistic / long jobs
Key takeaways
✔
nvidia-longjobs can be interrupted✔ Always use
--requeueif you want your job restarted✔ Handle
SIGTERMand checkpoint frequently✔ Expect jobs to restart multiple times
Last updated
Was this helpful?