Operating HPC clusters in Google Cloud Platform opens up many new possibilities and capabilities for your organization. However, "there is no free lunch". With all of these new possibilities, there's more for system administrators and engineers to control. Cluster-services offers an easy to use command line interface to modify compute partitions, Slurm accounting, and network mounted storage.
Through Fluid Numerics®' experience in developing custom cloud-HPC solutions, we've uncovered typical operation and maintenance tasks and encapsulated them in cluster-services. The cluster-services command line interface is used to modify compute partitions and to add or remove external filesystems, such as Lustre. Performing these types of operations on slurm-gcp manually requires multiple steps to ensure that the desired changes are achieved. Rather than modifying configuration files or re-deploying, use cluster-services to customize your slurm-gcp deployment.
The Fluid Numerics® Slurm-GCP marketplace deployment comes with a command line interface, called cluster-services, for managing your resources after deployment. The cluster-services CLI allows you manage your cluster's partitions and available machines, Slurm accounting, and external filesystem mounts.
The cluster-services CLI has been updated with the Version 2.3.0 release of fluid-slurm-gcp. Updates include
Updated help documentation
The default_partition item has been added to the cluster-config schema which allows users to specify a default Slurm partition.
--preview flag for all update commands allows you to preview the changes to your cluster prior to actually making the changes
cluster-services add user --name flag removed. Individual users can be added to the default slurm account using cluster-services add user <name>
User's can now obtain template cluster-config blocks using cluster-services sample all/mounts/partitions/slurm_accounts
User provided cluster-configs are now validated against /apps/cls/etc/cluster-config.schema.json
Added cluster-services logging to /apps/cls/log/cluster-services.log
Fixed incorrect core count bug with the partitions.machines.enable-hyperthreading flag
Removed add/remove mounts/partitions options; mounts and partitions are now updated by using update all, update mounts, and/or update partitions calls.
add/remove user call only adds or removes a user to the default Slurm account. These calls are strictly convenience calls.
cluster-config schema now specified compute, controller, and login images in compute_image, controller_image, and login_image rather than in the partitions.machines, controller, and login list-objects.
To customize your cluster, the following workflow is recommended :
1. Create a configuration file from the current configuration
2. Edit your config.yaml and validate and preview the changes. Note that all cluster-services update commands validate your config file against the cluster-config schema.
3. If the configuration file validates you approve the changes that are previewed, apply the changes :