Slurm Partitions Management
When operating an auto-scaling HPC cluster on the cloud, you have access to all of the various arrangements of virtual machines that the cloud provider has to offer. On Google Cloud, you can choose
CPU platform/machine type (n1, n2, n2d, c2, e2)
Number of vCPU/VM
Amount of memory/VM
GPU Type and GPU Count/per VM
This gives you numerous options when customizing a heterogeneous Cloud-HPC cluster for your organization. To facilitate customization of your cluster's compute nodes at any time, Fluid-Slurm-GCP comes with a command line tool called cluster-services and a dictionary schema to describe your cluster called a cluster-config.
Understanding Partitions and Machine Blocks
Cluster-services and the cluster-config dictionary categorize your Cloud-HPC compute nodes into "partitions" and "machine-blocks".
A machine block is a homogeneous group of Google Compute Engine (GCE) instances. VMs in a machine block share the following attributes
Machine type | The machine type on Google Cloud specifies the CPU platform, the n umber of vCPU/VM, and the amount of memory/VM.
Disk Type | Compute nodes in your cluster all have a boot disk. The disk type for that boot disk is either pd-standard or pd-ssd.
Disk Size (GB) | The disk size of each compute node's boot disk.
VM Image | By default, all of the compute nodes in your cluster use the fluid-slurm-gcp compute image. You can also build custom images on top of the fluid-slurm-gcp images to have operating system customizations, additional applications installed, or even pre-loaded datasets.
GPU Type & GPU Count | Your compute nodes can have up to 8 GPUs attached to them, provided they are deployed in a zone that offers GPUs.
Preemptibility | Preemptible GCE instances offer substantial cost savings, but only live for 24 hrs at most. Additionally, these instances can be preempted at any time during that 24 hour window. If your applications are fault tolerant and can recover from preemption, this is an effective cost saving option to consider.
Local SSDs | You can add up to eight 375GB local SSDs to each VM in a machine block. When more than one local SSD is added, they are fused in a RAID0 configuration and mounted to a directory of your choosing ( the default is /scratch ). Local SSDs provide
For each machine block, you can set a maximum allowable number of nodes that can be live at any time (max_node_count). However, it's important to make sure this value is consistent with your GCP quota.
Partitions (synonymous with Slurm Partitions) consist of an array of machine-blocks that have a few shared attributes.
GCP Project | A Google Cloud Project is used to group and manage cloud resources, billing, and permissions. You can configure multiple partitions in your cluster, each with their own GCP project. This provides an easy method for dividing up your monthly cloud bills across multiple cost-centers.
VM Labels | All resources on Google Cloud can have labels associated with them. This provides a good method for grouping resources for working with Google Cloud Operations (formerly Stackdriver) Monitoring and Logging.
Wall Clock Limit | Since each partition is a Slurm partition, you are able to configure the wall clock limit for each job submitted to the partition
On Fluid-Slurm-GCP clusters, you are able to have multiple compute partitions, with each partition having multiple machine types. This level of composability allows you to meet various business and technical needs.
When configuring partitions on your cluster, we recommend that you follow this four step workflow
Create a cluster-config file containing your cluster's current configuration
Edit the cluster-config file
Preview the changes. During this stage, your modified config file is validated against the cluster-config schema (/apps/cls/etc/cluster-config.schema.json) and the expected changes to take place are reported to standard output.
Apply the changes
Follow along with the Codelabs below to get hands-on experience configuring your partitions.