SLURM

From Research IT
Jump to: navigation, search

SLURM is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Getting Started

You may want to check out both the quickstart guide and the man pages before diving into this guide.

Choosing a Partition

Server: headnode.rit.albany.edu Batch is the default partition. There are other partitions but they are reserved and should not be used without permission. We aim to accommodate as many different operational modes as possible, so please come to us if you have special needs that don't fit within the current operating bounds.

Please do not ever use a node listed in a partition outside of slurm job control.

Nodes Available in General Cluster (batch)

Partitions Available

sinfo

sinfo is designed to give you an overview of the "partitions" that exist in order to run jobs. Here is the output from the command on hermes.

-bash-3.2$ sinfo
[as845383@headnode ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
snow         up 14-00:00:0     16   idle snow-[01-16]
batch*       up 14-00:00:0     14   idle cc1-[01-08],rhea-[01-06]
matrix       up   infinite      1  down* matrix-05
matrix       up   infinite      7   idle matrix-[01-04,06-08]


Here sinfo is telling us that we have multiple partitions in various states of consumption, job time limits for each queue and the default partition ( the one with the *, debug). The partition have been broken down in such a way as to maximize the ability of people to run a wide variety of jobs without interfering with each other.

Because we have shared processing it is also very handy to see the CPU breakdown by partition.

-bash-3.2$ sinfo -o "%P %C"
PARTITION CPUS(A/I/O/T)
snow 0/512/0/512
batch* 0/208/0/208
matrix 0/56/8/64
The important number for most is the second one since that lists the number of idle CPU's in each partition.

squeue

In order to see what is currently, or awaiting, processing we can take a look at squeue.

bash-3.2$ squeue
  JOBID PARTITION  NAME     USER    ST       TIME  NODES NODELIST(REASON)
  11411    batch vasp_job yx152122  PD       0:00      3 (Resources)
 11623     batch vasp_job yx152122   R    3:32:22      1 hermes-05
 11410     batch vasp_job yx152122   R   17:13:46      4 genesis-[02-05]
 11403     batch vasp_job yx152122   R   21:36:12      4 genesis-[07-10]
 11399     batch vasp_job yx152122   R   21:38:29      3 hermes-[02,04,08]
 11396     batch   sbatch   ig4895   R   21:52:58      1 genesis-01
 11397     batch   sbatch   ig4895   R   21:52:58      1 genesis-01
 11272     batch vasp_job yx152122   R 1-19:06:20      1 hermes-03
 11230     batch   sbatch   ig4895   R 4-00:26:42      1 genesis-11
 11231     batch   sbatch   ig4895   R 4-00:26:42      1 genesis-12


This shows that we have 9 jobs running with one job blocked waiting for "Resources". If there is more than one job waiting for a partition some may be waiting on "Priority". Anything in the parenthesis is a message indicating why the job is not yet running.


Here are three examples with the most basic option. In the first one we ran the `hostname` command on two cpu's with the default partition, in the second we ran the command on two nodes, in the third we ran the command on two cpu's in the batch partition.

NOTE: While it might seem possible to do a `srun -n 10 mpirun ...` it does not work as expected and will cause problems. Please only launch MPI jobs through a salloc or sbatch

salloc

The next step up on the complexity is salloc which allows us to run a multi-step job interactively. This is most useful for testing more complicated job schematics before submitting it to a larger queue as a batch job.

bash-3.2$ salloc -n4 sh
salloc: Granted job allocation 125
sh-3.2$ hostname
headnoe.rit.albany.edu
sh-3.2$ echo "Processing"
Processing
sh-3.2$ srun hostname
cc1-01
cc1-01
cc1-01
cc1-01
sh-3.2$ sleep 10
sh-3.2$ srun -l sleep 10
sh-3.2$ exit
exit
salloc: Relinquishing job allocation 125

Here we can start to see some more complicated job semantics. This shell runs on the submit node, but when we call srun we are running on our partition allocation. We can relinquish the allocation by exiting the sub-shell.

You can also use salloc to launch a single mpi job interactively.

salloc -n4 mpirun /network/rit/lab/ritstaff/intelsoftware/mrbayes-3.1.2_p/mb test.txt

sbatch

So far we have covered the basics in terms of exploring SLURM interactively, but what most people are interested in are running jobs in bulk, non-interactivly. sbatch is the interface for submitting these jobs. sbatch takes virtually identical arguments as srun and salloc, but expects the last argument is a shell script to drive the batch job.

bash-3.2$ cat test_job
#!/bin/sh 
echo "This line gets run once at the launching node"
hostname
echo "These get launched all all allocated CPUs"
srun -l hostname
srun -l sleep 60
echo "All done"
bash-3.2$ sbatch -n2 test_job
Submitted batch job 135

By default the input is /dev/null and output is directed to slurm-<jobid>.out in this case it would be slurm-135.out.

bash-3.2$ cat slurm-135.out
This line gets run once at the launching node
hermes-01.rit.albany.edu
These get launched all all allocated CPUs
0: hermes-01.rit.albany.edu
1: hermes-01.rit.albany.edu
All done

sbatch wrapping

For many jobs writing a whole shell script may be unnecessary overhead to get started. For those jobs which are simple one-liners you may be able to use sbatch wrapping to get you up and running quickly.

sbatch -p batch -n4 --wrap="mpirun /network/rit/lab/ritstaff/intelsoftware/mrbayes-3.1.2_p/mb test.txt"
Submitted batch job 139
bash-3.2$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    139     DEBUG   sbatch   ew2193   R       3:22      1 hermes-01

Wrapping is appropriate when your shell script would have been equivalent to the following.

#!/bin/sh
#
<MY COMMAND>

Serial Jobs

Now that we have the basics covered we should cover submitting serial jobs in bulk. I will use an example application called dnetc.

-bash-3.2$ more dnetc.sh
#!/bin/sh
#SBATCH --mem-per-cpu=30 --output=output/slurm-%j.out --nice=10000 --time=07:00:00
#
srun -n1 ./dnetc -n 1 -numcpu 0 -priority 9
#
exit 0  # since dnetc likes to exit with non-zero exit codes which are listed as failure in slurm I return a 0 to indicate success.

Here I have my one run slurm script. I have embedded sbatch options so that I don't need to type them on the command line every time. The `--mem-per-cpu` is the megabytes of ram that the job consumes per slot, `--output` sends the stdout to another location so I don't clutter my working directory, `--nice=10000` means that my jobs will be last in the queue, and `--time` specifies that my job will take no longer than 7 hours to run.

To submit this job X times you can either use bash or tcsh code.

bash

-bash-3.2$ for num in {1..10} ; do sbatch -p batch dnetc.sh ; done

tcsh

[ew2193@hermes ~]$ repeat 10 sbatch -p batch dnetc.sh

parallel jobs / mpi

SLURM has full support for many MPI interfaces and RIT has tested openmpi with great success. Openmpi is installed on all cluster nodes and automatically picks up your slurm allocation. LAM is supported but we have not yet tested it, please feel free to check out the SLURM website for details on how to run slurm jobs.

There are two ways to run mpi jobs interactive or batch.

bash-3.2$ salloc -n 4
salloc: Granted job allocation 147
bash-3.2$ mpirun -nolocal network/rit/lab/ritstaff/intelsoftware/mrbayes-3.1.2_p/mb test.tx
...
exit
exit
salloc: Relinquishing job allocation 147

NOTE: using -nolocal prevents the mpi job from also launching on the headnode when using noalloc, but you should not use it in your scripts to sbatch or you will have allocated resources that do not get used

With batch you can either develop a script as shown above or launch it as a wrap.

sbatch -n4 -p batch --wrap="mpirun /network/rit/lab/ritstaff/intelsoftware/mrbayes-3.1.2_p/mb test.txt"
Submitted batch job 146

srun

srun allows you to run a job on the partition. With srun you are limited to a single "step" job running interactively.

[as845383@headnode ~]$ srun -l -n2 hostname

1: cc1-01 0: cc1-01 [as845383@headnode ~]$ srun -l -N2 hostname 0: cc1-01 1: cc1-02 [as845383@headnode ~]$ srun -l -n4 -p batch hostname 1: cc1-01 3: cc1-01 2: cc1-01 0: cc1-01

SMP jobs

On top of being able to request how many "tasks" you wish you can also tell SLURM how many CPU's are required for each task. This way your SMP applications will be properly allocated.

srun -n 1 -c 4 my4threadapp

This requests a slot for 1 task that requires 4 threads of execution. Since it is one task it will always be allocated from one node.

Other possibilities

We have only just scratched the surface of what is possible. The scheduler has ways of launching jobs in the future, making jobs dependent on other jobs or even singleton jobs. If you have questions or concerns please contact us to go over the options.

Time Defaults

      -t, --time=


By default all jobs get a 14 day window to run. This is set by the queue and can be overridden by the user to a shorter time. Properly setting the time limit on your jobs will allow them to be scheduled more efficiently. After this time expires the job is killed. If you know that your job can run and complete in less than 14 days updating this for your job will allow it to be scheduled more efficiently.

scontrol update jobid=<jobid> TimeLimit=+1:00:00

Maintenance Windows

Time limits are important for maintenance windows. If your job has a time limit of 14 days and there is a maintenance window within 14 days your jobs will not run. SLURM looks ahead and knows that you need to vacate the resource before the job will complete and does not start it. If your time limit is set properly SLURM can schedule your job so that it will run during that pre-shutdown period.

Discovering Job Runtime

If you have been running jobs and not paying attention to how long they run you can discover how long your previous jobs took to run with the sacct command. The example below shows all of my completed jobs since January 1st, 2011 sorted by time

List detailed information for a job (useful for troubleshooting):
scontrol show jobid -dd <jobid>

[root@hermes ~]# sacct -u ew2193 --start 1/1/11 -o JobID,Elapsed -X -s cd --noheader | sort -k 2
     47908   00:00:00 
     47909   00:00:00 
     47910   00:00:00 
     47842   00:00:01 
     47843   00:00:01 
     47847   00:00:01 
....
     56668   13:25:06 
     37017   13:26:25 
     35662   13:30:03 
     37240   13:34:13 
     35658   13:40:13 
     78893   13:41:17 

In my case I can safely add --time=14:00:00 ( 14 hours ) to my job submission because I have never exceed that time allocation.