Slurm Partitions Management

Overview

When operating an auto-scaling HPC cluster on the cloud, you have access to all of the various arrangements of virtual machines that the cloud provider has to offer. On Google Cloud, you can choose

  • CPU platform/machine type (n1, n2, n2d, c2, e2)

  • Number of vCPU/VM

  • Amount of memory/VM

  • GPU Type and GPU Count/per VM

  • Preemptibility

  • VM image

This gives you numerous options when customizing a heterogeneous Cloud-HPC cluster for your organization. To facilitate customization of your cluster's compute nodes at any time, Fluid-Slurm-GCP comes with a command line tool called cluster-services and a dictionary schema to describe your cluster called a cluster-config.

Understanding Partitions and Machine Blocks

Cluster-services and the cluster-config dictionary categorize your Cloud-HPC compute nodes into "partitions" and "machine-blocks".

Machine Block

A machine block is a homogeneous group of Google Compute Engine (GCE) instances. VMs in a machine block share the following attributes

  • Machine type | The machine type on Google Cloud specifies the CPU platform, the n umber of vCPU/VM, and the amount of memory/VM.

  • Disk Type | Compute nodes in your cluster all have a boot disk. The disk type for that boot disk is either pd-standard or pd-ssd.

  • Disk Size (GB) | The disk size of each compute node's boot disk.

  • VM Image | By default, all of the compute nodes in your cluster use the fluid-slurm-gcp compute image. You can also build custom images on top of the fluid-slurm-gcp images to have operating system customizations, additional applications installed, or even pre-loaded datasets.

  • GPU Type & GPU Count | Your compute nodes can have up to 8 GPUs attached to them, provided they are deployed in a zone that offers GPUs.

  • Preemptibility | Preemptible GCE instances offer substantial cost savings, but only live for 24 hrs at most. Additionally, these instances can be preempted at any time during that 24 hour window. If your applications are fault tolerant and can recover from preemption, this is an effective cost saving option to consider.

  • Local SSDs | You can add up to eight 375GB local SSDs to each VM in a machine block. When more than one local SSD is added, they are fused in a RAID0 configuration and mounted to a directory of your choosing ( the default is /scratch ). Local SSDs provide

  • Zone & Subnetwork | All machines within a machine block are deployed in the same GCP Zone and on the same VPC subnetwork.

For each machine block, you can set a maximum allowable number of nodes that can be live at any time (max_node_count). However, it's important to make sure this value is consistent with your GCP quota.

Partitions

Partitions (synonymous with Slurm Partitions) consist of an array of machine-blocks that have a few shared attributes.

  • GCP Project | A Google Cloud Project is used to group and manage cloud resources, billing, and permissions. You can configure multiple partitions in your cluster, each with their own GCP project. This provides an easy method for dividing up your monthly cloud bills across multiple cost-centers.

  • VM Labels | All resources on Google Cloud can have labels associated with them. This provides a good method for grouping resources for working with Google Cloud Operations (formerly Stackdriver) Monitoring and Logging.

  • Wall Clock Limit | Since each partition is a Slurm partition, you are able to configure the wall clock limit for each job submitted to the partition

On Fluid-Slurm-GCP clusters, you are able to have multiple compute partitions, with each partition having multiple machine types. This level of composability allows you to meet various business and technical needs.

Using cluster-services

When configuring partitions on your cluster, we recommend that you follow this four step workflow

  • Create a cluster-config file containing your cluster's current configuration

cluster-services list all > config.yaml

  • Edit the cluster-config file

  • Preview the changes. During this stage, your modified config file is validated against the cluster-config schema (/apps/cls/etc/cluster-config.schema.json) and the expected changes to take place are reported to standard output.

cluster-services update partitions --config=config.yaml --preview

  • Apply the changes

cluster-services update partitions --config=config.yaml

Examples

Follow along with the Codelabs below to get hands-on experience configuring your partitions.