githubEdit

Slurm

Getting Started with Slurm in ExCL with best practice recommendations.

Getting Started with Slurm

For general Slurm use and reference, see Job Submission with Slurm - CADES User Documentationarrow-up-right, CADES Cheat Sheetarrow-up-right, and Cheat Sheet | ExCL User Docsarrow-up-right.

sbatch Template

Here is a recommended Slurm sbatch script template:

#!/bin/bash

#SBATCH --job-name=test
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --output=slurm-test-out.txt
#SBATCH --error=slurm-test-err.txt
#SBATCH --partition=compute
#SBATCH --nodelist=justify

GIT_ROOT=$(git rev-parse --show-toplevel)
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

echo Started: $(date)
echo Host: $(hostname)
echo Path: $(pwd)
echo --------------------------------------------------------------------------------

### Setup Environment

### Run Command
echo Run Task

echo --------------------------------------------------------------------------------
echo Finished: $(date)

Using the Preemptable GPU Queue (nvidia-long)

The nvidia-long queue provides opportunistic GPU access for long-running or flexible workloads. Jobs in this queue may be preempted when higher-priority work arrives.

Preemption allows better overall cluster utilization, but your job must be written to tolerate interruption.


What “Preemptable” Means

Jobs submitted to nvidia-long:

  • Run when GPUs are idle

  • May be stopped at any time

  • Are requeued instead of canceled only if you request it

  • May restart multiple times

The normal GPU queue (nvidia) is not preempted.


Important Cluster Policy

The cluster is configured with:

What this means for users

  • Preemption is enabled

  • Preempted jobs are eligible to be requeued

  • Jobs are NOT requeueable by default

  • You must explicitly request requeue behavior

If you do not request requeue:

  • Your job will be canceled when preempted

  • Partial work will be lost


Submitting a Preemptable Job

Basic submission

⚠️ This job will be canceled on preemption unless you also add --requeue.


Making Your Job Requeueable (REQUIRED)

To allow your job to be restarted after preemption, you must include:

Example job script


When a job is preempted, Slurm sends a SIGTERM signal before stopping it.

You should:

  • Catch the signal

  • Save checkpoints

  • Exit cleanly

Example with signal handling

This gives your job 60 seconds to checkpoint before requeue.


How Requeued Jobs Behave

When a requeueable job is preempted:

  • It transitions to PENDING (BeginTime)

  • Slurm may delay restart briefly to avoid thrashing

  • scontrol show job will show:

    • Restarts=1 (or higher)

    • Requeue=1

This is expected behavior.


Monitoring Preemptable Jobs

Check job status

See restart count

View job history


When to Use nvidia-long

Good fits

  • Parameter sweeps

  • Model training with checkpoints

  • Monte Carlo workloads

  • Long experiments without strict deadlines

Not recommended

  • Jobs that cannot restart

  • Short, latency-sensitive runs

  • Jobs without checkpointing


Summary

Queue

Preemptable

Requires --requeue

Intended Use

nvidia

❌ No

N/A

Normal GPU jobs

nvidia-long

✅ Yes

✅ Yes

Opportunistic / long jobs

Key takeaways

  • nvidia-long jobs can be interrupted

  • Always use --requeue if you want your job restarted

  • ✔ Handle SIGTERM and checkpoint frequently

  • ✔ Expect jobs to restart multiple times

Last updated

Was this helpful?