Getting Work Done

Submit Batch Jobs

Batch jobs are useful when you have an application that can run unattended for long periods of time. You run a batch job by using the sbatch command with a batch file.

Writing a batch file

A batch file consists of two sections

  1. Batch header - Communicates settings to Slurm that specify your slurm account, the compute partition to submit the job to, the number of tasks to run, the amount of resources (cpu, gpu, and memory), and the task affinity.

  2. Shell script instructions

Below is a simple example batch file:

#!/bin/bash
#SBATCH --account=my-slurm-account
#SBATCH --partition=this-partition
#SBATCH --job-name=example_job_name
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --time=00:05:00
#SBATCH --output=serial_test_%j.log

hostname

The above batch file has multiple constraints that dictate how the job will be executed.

  • --account=my-slurm-account indicates that you are using the "my-slurm-account" to log the resource usage against.

  • --partition=this-partition requests the job execute on a partition called "this-partition"

  • --job-name=’name’ sets the job name

  • --ntasks=1 advises the slurm controller that job steps run within the allocation will launch a maximum of 1 tasks

  • --ntasks-per-node=1 When used by itself, this constraint requests that 1 task per node be invoked. When used with --ntasks, --ntasks-per-node is treated as the maximum count of tasks per node.

  • --gres=gpu:2 indicates that 2 GPUs are requested to execute this batch job

  • --time=00:05:00 sets a total run time of 5 minutes for job allocation.

  • --output=name.log out creates a file containing the batch script’s stdout and stderr.

SchedMD’s sbatch documentation provides a more complete description of the sbatch command line interface and the available options for specifying resource requirements and task affinity.

Submitting a batch job

Batch jobs are submitted using the sbatch command.

sbatch example.batch

Once you have submitted your batch job, you can check the status of your job with the squeue command. Since Fluid-Slurm-GCP is an autoscaling cluster, you may notice that your job is in a configuring (CF) state for some time before starting. This happens because compute nodes are created when needed to meet the compute resource demands on-the-fly. This process can take anywhere from 30s - 3 minutes.

squeue

Interactive Jobs

The interactive workflows described here use a combination of salloc and srun command line interfaces. It is highly recommended that you read through SchedMD's salloc documentation and srun documentation to understand how to reserve and release compute resources in addition to specifying task affinity and other resource-task bindings.

For all interactive workflows, you should be aware that you are charged for each second of allocated compute resources. It is best practice to set a wall-time when allocating resources. This practice helps avoid situations where you will be billed for idle resources you have reserved.

Allocate and Execute Workflow

With slurm, you can allocate compute resources that are reserved for your sole use. This is done using the salloc command. As an example, you can reserve exclusive access on one compute node for an hour on the default partition

salloc --account=my-account --partition=this-partition --time=1:00:00 --N1 --exclusive

Once resources are allocated, Slurm responds with a job id. From here, you can execute commands on compute resources using `srun`. srun is a command line interface for executing “job steps” in slurm. You can specify how much of the allocated compute resources to use for each job step. For example, the srun command below executes provides access to 4 cores for executing ./my-application.

srun -n4 ./my-application

It is highly recommended that you familiarize yourself with Slurm’s salloc and srun command line tools so that you can make efficient use of your compute resources.

To release your allocation before the requested wall-time, you can use scancel

scancel <job-id>

After cancelling your job, or after the wall-clock limit is exceeded, Slurm will automatically delete compute nodes for you.

Interactive Shell Workflow (with Graphics Forwarding)

If your workflow requires graphics forwarding from compute resources, you can allocate resources as before using salloc, e.g.,

salloc --account=my-account --partition=this-partition --time=1:00:00 --N1 --exclusive

Once resources are allocated, you can launch a shell on the compute resources with X11 forwarding enabled.

srun -N1 --pty --x11 /bin/bash

Once you are complete with your work, exit the shell and release your resources.

exit
scancel <job-id>

Run MPI Applications

Fluid-Slurm-GCP comes with OpenMPI preinstalled and integrated with the Slurm job scheduler. OpenMPI can be brought into your path by using module load

module load gcc/10.2.0 openmpi/4.0.2

Once loaded in your path, you can build your application with OpenMPI using mpif90, mpic++, or mpicc. Once compiled you can run your applications using srun either through interactive jobs or batch jobs. Below is a simple example of running an MPI application with 8 MPI ranks, using srun.

#!/bin/bash
#SBATCH --account=my-slurm-account
#SBATCH --partition=this-partition
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=8

module load gcc/10.2.0 openmpi/4.0.2
srun -n8 ./my-application

Task Affinity

Understanding Slurm Resource flags

Slurm allows you to request specific amounts of vCPU, memory, and GPUs when submitting a job. Commonly used flags in Slurm batch headers with their purposes are listed below

  • --ntasks : The number of tasks that need to be executed. Typically, for MPI jobs, this is equivalent to the number of MPI ranks.

  • --cpus-per-task : The number of logical cpus (vCPUs) to assign to each task. On modern hardware, a physical core can support two hyperthreads (logical CPUs). Some codes benefit from running on all hyperthreads, while others do not.

  • --mem-per-cpu : The amount of memory need for each vCPU used in launching jobs. By default, this value is set to the total VM memory divided by the number of vCPUs.

  • --gres : A flag for specifying any generic resources to assign to the job. Currently, GPUs are made available as generic resources.

When a batch job is launched and the resources requested are allocated, you have the ability to specify how to map individual tasks onto the allocated resources. This mapping of task to hardware is called the Task Affinity. Task affinity can be controlled with mpirun/mpiexec or Slurm's srun. Proper task affinity is necessary to obtain optimal performance of your application.

Fluid Numerics offers support services to help you profile, benchmark, and optimize your HPC and HTC applications. Request support for more information

Specifying task affinity with mpirun

OpenMPI comes pre-installed on Fluid-Slurm-GCP systems and is capable of specifying task affinity.

  • --report-bindings : Shows how MPI ranks are bound to resources

  • --bind-to : Specify which hardware resource to bind each MPI rank to. Options are hwthread, core, socket, node. Binding to a specific component of hardware prevents the process from leaving the assigned hardware component.

  • --map-by : Specify how MPI ranks are mapped to hardware. Options are hwthread, core, socket, node.

  • --np : Specify the number of MPI ranks to launch your application with

Map MPI ranks to physical cores

This Slurm batch file template can be used to align two logical CPUs per task. The mpirun flags used bind the MPI ranks to hardware cores and map MPI ranks in sequence on the physical cores. In this scenario, each MPI rank is executed on its own physical core, but can switch between the two available hyperthreads on each core.

#SBATCH --partition=c2-standard-60
#SBATCH --ntasks=30
#SBATCH --cpus-per-task=2

module load gcc/10.2.0 openmpi/4.0.2

mpirun -np 30 --map-by core --bind-to core --report-bindings

Map MPI ranks to hardware threads

This Slurm batch file template can be used to align one logical CPUs per task. The mpirun flags used bind the MPI ranks to hardware cores and map MPI ranks in sequence on the physical cores. In this scenario, each MPI rank is executed on its own physical core, but can switch between the two available hyperthreads on each core.

#SBATCH --partition=c2-standard-60
#SBATCH --ntasks=30
#SBATCH --cpus-per-task=1

module load gcc/10.2.0 openmpi/4.0.2

mpirun -np 30 --map-by core --bind-to hwthread --report-bindings

Specifying task affinity with srun

Applications built with OpenMPI expect that each MPI rank is assigned to slots on compute nodes. When working with Slurm, the number of slots is equivalent to the number of tasks ( --ntasks). Additionally, you are able to control the task affinity, which means you can specify (as detailed as you like) how to map each MPI rank to the underlying compute hardware.

The easiest place to get started, if you are unsure of an ideal mapping, is to use the --hint flag with srun. This flag allows you to suggest to srun whether your application is compute bound (--hint=compute_bound), memory bound (--hint=memory_bound), or communication intensive (--hint=multithread). Additionally, you can add the --cpu-bind=verbose flag to report the task affinity back to STDOUT.

If you are unsure of your application's performance bottlenecks, you are encouraged to profile your application.

For compute bound applications, --hint=compute_bound will map tasks to all available cores.

srun -n8 --hint=compute_bound --cpu-bind=verbose./my-application

For memory bound applications, --hint=memory_bound will map tasks to one core for each socket, giving the highest possible memory bandwidth for each task.

srun -n8 --hint=compute_bound --cpu-bind=verbose./my-application

For communication intensive applications, --hint=multithread will map tasks to hardware threads (hyperthreads/virtual CPUs)

srun -n8 --hint=multithread --cpu-bind=verbose./my-application

In addition to hints, you can use the following high level flags

  • --sockets-per-node | Specify the number of sockets to allocate per VM

  • --cores-per-socket | Specify the number of cores per socket to allocate

  • --threads-per-core | Specify the number of threads per core to allocate

For more details on task affinity, simply run srun --help or visit SchedMD's documentation on multi-core/multi-threaded architecture support.

Run GPU Accelerated Applications

Fluid-Slurm-GCP comes with the CUDA toolkit (/usr/local/cuda) and ROCm (/opt/rocm) preinstalled and configured to be in your default path. Currently, Fluid-Slurm-GCP and Google Cloud Platform only offer Nvidia GPUs.

Provided some of your cluster's compute partitions have GPUs attached, you can use the --gres=gpu:N flag, where N is the number of GPUs needed per node.

#!/bin/bash
#SBATCH --account=my-slurm-account
#SBATCH --partition=this-partition
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1

./my-application

Multi-GPU with 1 MPI rank per GPU

When submitting jobs to run on multiple GPUs, you may find it necessary to bind MPI ranks to GPUs on the same node. In the example below, we assume that you have 8 GPUs per node, and you want to run 16 MPI tasks, with 1 MPI rank per GPU

#!/bin/bash
#SBATCH --account=my-slurm-account
#SBATCH --partition=this-partition
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8

srun -n16 --accel-bind=gpu ./my-application

Monitoring Jobs and Resources

Checking Slurm job status

Slurm's squeue command can be used to keep track of jobs that have been submitted to the job queue.

squeue

You can use optional flags, such as --user and --partition to filter results based on username or compute partition associated with each job.

squeue --user=USERNAME --partition=PARTITION

Slurm jobs have a status code associated with them which change during the lifespan of the job

  • CF | The job is in a configuring state. Typically this state is seen when autoscaling compute nodes are being provisioned to execute work.

  • PD | The job is in a pending state.

  • R | The job is in a running state.

  • CG | The job is in a completing state and the associated compute resources are being cleaned up.

  • (Resources) | There are insufficient resources available to schedule your job at the moment.

Checking Slurm compute node status

Slurm's sinfo command can be used to keep track of the compute nodes and partitions available for executing workloads.

sinfo

Compute nodes have a status code associated with them that change during the lifespan of each node. A few common state codes are shown below. A more detailed list can be found in SchedMD's documentation.

  • idle| The compute node is in an idle state and can receive work.

  • down | The compute node is in a down state and may need to be drained and placed back in down state. Downed nodes are also symptomatic of other issues on your cluster, such as insufficient quota or improperly configured machine blocks.

  • mixed | A portion of the compute nodes resources have been allocated, but additional resources are still available for work.

  • allocated | The compute node is fully allocated

Additionally, east state code has a modifier with the following meanings

  • ~ | The compute node is in a "cloud" state and will need to be provisioned before receiving work

  • # | The compute node is currently being provisioned (powering up)

  • % | The compute node is currently being deleted (powering down)