Slurm Basics
Slurm is an open source and highly scalable system that is fault-tolerant and used to manage clusters and schedule jobs.
The main task of Slurm is to manage the cluster workload itself and as such has three key functions. First, it grants users access to resources on computing nodes for a specified period so that users can work on them and run jobs. Secondly, it provides a framework for starting, executing, and monitoring jobs on a set of assigned nodes and finally, it is also in charge of managing the queue of jobs which are waiting for resources to be released and therefore these jobs to be ready for execution.
To use the cluster via the Slurm system, the user must gain access to the login node on the selected cluster. The login node is a computer that is directly connected to the system and properly configured to communicate with the control daemon and has command-line tools installed. The user can connect to the login node via an SSH connection to log in to his user account in the selected cluster, gaining access to a remote shell on the login node, from where they can then submit jobs, which are eventually taken over by Slurm.
Slurm uses four basic steps to manage CPU resources for a job:
- Selection of nodes
- Allocationof CPUs from selected nodes
- Distributionof tasks to selected nodes
- Optional distribution and binding of tasks to allocated CPUs within a node (task affinity).
User & admin commands
Important tools or the commands are:
- sinfo – check system status data.
- squeue – check task status information.
- scancel – cancel a job in a queue or in progress.
- sview – see graphically displayed information about the status of jobs, partitions, and nodes which are managed by Slurm.
- scontrol – for cluster monitoring, administration, and configuration.
- sstat – show status of running jobs.
User commands
- srun – used to perform tasks in real time. A job can contain multiple job steps executing sequentially or in parallel on independent or shared resources within the job's node allocation. Srun has a wide range of options for determining resource requirements, including minimum and maximum number of nodes, number of processors, specific nodes to be used or not, and specific node characteristics (amount of memory required, required disk space, some required functions, etc.).
- salloc – used to allocate resources for a job in real time. Typically, this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.
- sbatch - used to run executable files that will be executed when it's their turn. An executable file can contain multiple srun commands to run jobs. When submitting a job through sbatch, you get the ID/number of the task in the queue.
Admin commands
- sacctmgr – to manage the database, cluster data, users, accounts, and partitions.
- sview – see graphically displayed information about the status of jobs, partitions and nodes which are managed by Slurm.
- sreport – display information from accounting database on jobs, users, clusters.
Other useful commands
- sprio – used to display a detailed view of the components that affect the priority of the work.
- sacct – check data on both, current and completed jobs, and tasks and generate reports.
- sattach – to attach standard input, output, and erroneous and signal capabilities to the currently running job or job step.
- sbcast, sgather – used to transfer a file from local disk to local disk on the nodes allocated to a job and the other way around.
Slurm implements a documented API for all the listed functions.
The (salloc)
command is used to allocate resources (a set of nodes) to perform operations on those resources, preferably with some restrictions (e.g. the number of processors per node). When salloc successfully obtains the requested assignment, it then runs a user-specified command. When a particular command is completed, salloc resigns the job assignment. The command can be any program that the user wants to run. Some typical commands are xterm, a batch script that contains srun commands, and even a standalone srun command. If no command is specified, salloc starts the user's default shell.
Although the salloc command can also be used in combination with the srun command, it is ordinarily used in combination with the sbatch command.
Options that can be attached to the srun, salloc, and sbatch commands
Sbatch option | Default value | Description |
---|---|---|
--nodes=<number> |
1 | Number of nodes for the allocation. |
-N <number> |
||
--ntasks=<number> |
1 | Number of tasks (MPI processes). Can be omitted if --nodes and --ntasks-per-node are given. |
-n <number> |
||
--ntasks-per-node=<num> |
1 | Number of tasks per node. If keyword omitted the default value is used. |
--cpus-per-task=<number> |
1 | Number of threads (logical cores) per task. Used for OpenMP or hybrid jobs. |
-c <number> |
||
--mem=<size[units]> |
- | Required amount of memory. |
--mem-per-cpu=<num[units]> |
||
--mem-per-gpu=<num[units]> |
Note: A null memory size specification assigns a job access to all of the memory on each specified node. | |
--mail-user=<email> |
- | Email address for notifications. |
--output=<path>/<file pattern> |
slurm-%j.out | Standard output file. |
-o <path>/<file pattern> |
(%j = JobID) | |
--error=<path>/<file pattern> |
slurm-%j.err | Standard error file. |
-e <path>/<file pattern> |
(%j = JobID) | |
--time=<walltime> |
partition dependent | Requested walltime limit for the job. |
-t <walltime> |
||
--partition=<name> |
none | Partition to run the job. |
-p <name> |
||
--constraint=<list> |
none | Node-features requested for the job. |
-C <list> |
See configuration for available features. | |
--job-name=<jobname> |
job script's name | Job name. |
-J <jobname> |
||
--account=<project account> |
none | Project that should be charged. |
-A <project_account> |
||
--exclude=<nodelist> |
- | Exclude specified nodes from job allocation. |
-x <nodelist> |
||
--nodelist=<nodelist> |
- | Request specified nodes for job allocation (if necessary additional nodes will be added to fulfil the requirement for number of nodes). |
-w <nodelist> |
||
--requeue or --no-requeue |
no-requeue | Specifies whether the batch job should be requeued after a node failure. Caution: if a job is requeued, the whole batch script is initiated from its beginning. |