Cluster Services

A quick reference for the cluster-services command line interface that assists maintenance and operations.

Develop an understanding of the schema that describes the Fluid-Slurm-GCP cluster.

About

Operating HPC clusters in Google Cloud Platform opens up many new possibilities and capabilities for your organization. However, "there is no free lunch". With all of these new possibilities, there's more for system administrators and engineers to control. Cluster-services offers an easy to use command line interface to modify compute partitions, Slurm accounting, and network mounted storage.

Through Fluid Numerics' experience in developing custom cloud-HPC solutions, we've uncovered typical operation and maintenance tasks and encapsulated them in cluster-services. The cluster-services command line interface is used to modify compute partitions and to add or remove external filesystems, such as Lustre. Performing these types of operations on slurm-gcp manually requires multiple steps to ensure that the desired changes are achieved. Rather than modifying configuration files or re-deploying, use cluster-services to customize your slurm-gcp deployment.

The Fluid Numerics Slurm-GCP marketplace deployment comes with a command line interface, called cluster-services, for managing your resources after deployment. The cluster-services CLI allows you manage your cluster's partitions and available machines, Slurm accounting, and external filesystem mounts.

Updates

The cluster-services CLI has been updated with the Version 2.3.0 release of fluid-slurm-gcp. Updates include

  • Updated help documentation
  • The default_partition item has been added to the cluster-config schema which allows users to specify a default Slurm partition.
  • --preview flag for all update commands allows you to preview the changes to your cluster prior to actually making the changes
  • cluster-services add user --name flag removed. Individual users can be added to the default slurm account using cluster-services add user <name>
  • User's can now obtain template cluster-config blocks using cluster-services sample all/mounts/partitions/slurm_accounts
  • User provided cluster-configs are now validated against /apps/cls/etc/cluster-config.schema.json
  • Added cluster-services logging to /apps/cls/log/cluster-services.log
  • Fixed incorrect core count bug with the partitions[].machines[].enable-hyperthreading flag
  • Removed add/remove mounts/partitions options; mounts and partitions are now updated by using update all, update mounts, and/or update partitions calls.
  • add/remove user call only adds or removes a user to the default Slurm account. These calls are strictly convenience calls.
  • cluster-config schema now specified compute, controller, and login images in compute_image, controller_image, and login_image rather than in the partitions.machines, controller, and login list-objects.

Usage

To customize your cluster, the following workflow is recommended :

1. Create a configuration file from the current configuration

$ sudo su
[root]# cluster-services list all > config.yaml

2. Edit your config.yaml and validate and preview the changes. Note that all cluster-services update commands validate your config file against the cluster-config schema.

[root]# cluster-services update all --preview --config=config.yaml

3. If the configuration file validates you approve the changes that are previewed, apply the changes :

[root]# cluster-services update all --config=config.yaml

Tutorials

Learn how to build and operate your fluid-slurm-gcp cluster