Multi-Project Slurm-GCP for Universities and Large Organizations

Overview

The fluid-slurm-gcp deployment is a scalable, image-based, multi-project capable deployment of slurm-gcp that integrates naturally with billing platforms, like Orbitera. It allows for organizations to operate a shared HPC cluster or many HPC clusters in Google Cloud Platform while maintaining the ability to bill groups for their usage of the resources. This is useful for organizations that want to break up cloud expenses across many research grants and/or multiple departments.


This documentation covers the GCP infrastructure that can be maintained by Fluid Numerics’ fluid-slurm-gcp. I’ll begin with a description of each of the components that make up a complete multi-project fluid-slurm-gcp deployment. To illustrate how fluid-slurm-gcp can be used operationally, we will walk through an example organization with system administrators and research teams. System administrators are a group of individuals that manage and operate fluid-slurm-gcp resources. Research teams are groups of individuals that will use available fluid-slurm-gcp resources for executing HPC applications.



Multi-Project Fluid-Slurm-GCP Components

Fluid Numerics’ multi-project fluid-slurm-gcp consists of operating system images and tools for managing your infrastructure-as-code. A single fluid-slurm-gcp HPC cluster can be deployed through the GCP marketplace; multi-project capabilities through marketplace are expected to be available in February 2020. Fluid Numerics’ also has Terraform modules to manage users, user groups, project hierarchy, IAM policies, networking resources and firewall rules, and fluid-slurm-gcp HPC cluster.

If you are interested in working with these terraform modules and multi-project fluid-slurm-gcp prior to the marketplace upgrade, reach out to sales@fluidnumerics.com .

When we speak of components, we are specifically referring to the components maintained by infrastructure-as-code to make a multi-project fluid-slurm-gcp deployment possible. These components are

  1. GSuite/Cloud Identity - Directory Management of users and user groups
  2. Resource Collections - Project Hierarchy and IAM policies
  3. Network Resources - Shared VPC Networks, Subnetworks, and Firewall Rules
  4. Fluid-Slurm-GCP Clusters - Compute and disk resources in addition to Slurm configurations.

Fluid-slurm-gcp images are used in the GCP marketplace deployment and can be referenced in the Terraform scripts when specifying the attributes for the clusters.


GSuite and Cloud Identity

GSuite and Cloud Identity are services from Google that provide access to Google Directory. Google Directory is used to define user accounts and group accounts within your organization. In the context of GCP Identity Access Management (IAM), groups can help keep policy management clean and simple by grouping users based on job function and applying permissions by group.

For example, a university department with 50 employees filling roles of directors, office administration staff, IT system administrators, researchers, and students. Rather than applying permissions to all 50 employees individually, each employee can be placed into groups defined by their organization role in GSuite/Cloud Identity. From here, permissions can be applied to the groups (director, office administrator, etc.), reducing the number of IAM policies specified drastically.


Resource Collections

GCP allows you to organize your cloud projects in a way that resembles file system hierarchies. At the top of the hierarchy is your GCP organization node. Underneath the organization node, you can create folders and projects. You can create folders and projects underneath folders as well. In GCP, permissions are inherited from the organization node and parent folders.

For example, someone provided the Billing Administrator role at the organization node will have Billing Administrator role on all projects and folders within the organization. As another example, permissions applied at a folder are inherited to all subfolders and projects underneath that folder and subfolders.

A resource collection consists of a folder, IAM policies applied to the folder, and GCP projects that sit directly underneath the folder. The structure of a resource collection is driven by the assumption that Gsuite groups work across multiple GCP projects. For fluid-slurm-gcp, this structure is required for teams and organizations that want to segregate billing charges.


Network Resources

Compute resources on GCP communicate with other resources and services through virtual private cloud (VPC) networks. Shared VPC networks are network resources in GCP that can be defined and controlled in one project, but used by resources in other projects. Shared VPC allows system administrators to control traffic into and out of multiple slurm-gcp clusters within one centralized project, rather than having to maintain many VPC networks.

Network resources for multi-project deployments of fluid-slurm-gcp consist of a project marked as the shared VPC host project, projects marked as the shared VPC service projects, subnetworks for each fluid-slurm-gcp cluster, and firewall rules.


Fluid-Slurm-GCP Clusters

The fluid-slurm-gcp cluster provides a set of login nodes, a controller node, and static and ephemeral compute nodes. The login nodes serve as the point-of-access for cluster users. The controller hosts the Slurm job scheduler and the Slurm database. Compute nodes are used for executing workloads scheduled by the Slurm job scheduler. Ephemeral compute nodes are created in GCP when needed and are deleted after remaining idle for 5 minutes, by default.

Fluid Numerics’ provides public images for fluid-slurm-gcp clusters that incur a usage fee based on CPU and GPU usage of login, controller, and compute nodes. The usage fee pricing schedule is

  • $0.01/vCPU/hour
  • $0.09/GPU/hour

Configuring a cluster with all ephemeral compute nodes is the preferred configuration when usage is expected to be highly variable and cost-savings is preferred. This is because you are only charged when the compute instances are active within your GCP projects. By letting Slurm automatically delete idle instances, you can realize significant cost-savings over using static compute instances.


Multi-Project Fluid-Slurm-GCP Example

With the components of multi-project fluid-slurm-gcp introduced, we will now walk through an example deployment. We start by introducing a fictitious example organization along with accounting/business requirements and follow with a configuration description for each of the fluid-slurm-gcp components. We will conclude this example with a demonstration of how these resources integrate easily with Orbitera.

Case Study Background

The Department of Applied Mathematics at Generic University is coming up to their next procurement cycle to replace an old on-premise, moderately sized HPC cluster. They are considering cloud computing resources to host an HPC cluster, provided research teams can receive individual bills for the cloud resources they have used. Additionally, they would like to provide billing visibility to principal investigators (PIs) so they can make sure they are staying within budget.

A small team of system administrators and two research teams within the applied mathematics department have decided to work together to assess the viability of cloud-HPC resources on GCP as part of a pilot project.

The roles for each group of individuals is given below

  • System Administrators
    • Maintain multi-project fluid-slurm-gcp resources for researchers
  • CFD Research Team
    • A research team using a fluid-slurm-gcp cluster for their work. They want to make sure their software ports and performs well on cloud resources. This team has one grant from DOE and one grant from NSF.
  • Biomechanics Research Team
    • Another research team using a fluid-slurm-gcp cluster for their work. They also want to make sure their software ports and performs well on cloud resources. This team has one grant from NSF and one grant from NIH.

For both research teams, compute resources dedicated to work on a grant needs to be able to be segregated from other cloud expenses. Expenses associated with cloud resources that support multiple grants, such as login nodes and file servers, need to be isolated as well.


Solution

GSuite/Cloud Identity

For this pilot project, we will create three GSuite/Cloud Identity groups, as illustrated in Figure 1.

  1. System Admins
  2. CFD Research Team
  3. Biomechanics Research Team

User accounts are also created for each individual participating in the pilot project and they are added as members to their respective group.

Figure 1 : An example configuration of Cloud Identity/Gsuite groups and users.

Resource Collections

We create a folder called “slurm_gcp_folder” beneath Generic University’s organization node in GCP. Underneath this folder, we create resource collections for the system administrators, the CFD research team, and the Biomechanics research team. The organization hierarchy is shown in Figure 2.

Figure 2: An example project hierarchy created using the fluid_resource_collections module. In this example, there are three resource collections ( “admin”, “CFD-research-team”, and “BIOMECH-research-team”). The resource collections are deployed underneath the parent folder “slurm_gcp_folder” assumed to be created prior to creating the resource_collections.

This administrators resources collection consists of a single project, slurm-gcp-host-project, which hosts the Shared VPC network. We create a custom role, called “Slurm GCP Admin” at the organization level. This role provides minimal permissions for ssh access through OS login and allows the user to escalate privileges on linux instances in GCP. Administrators are given the Slurm-GCP Admin and Compute Admin roles at the admin folder level. We’ve also provided the Slurm GCP Admin role to the administrators group on each of the research teams’ folders.

The CFD research team resource collection consists of the cfd-controller-project, cfd-nsf-project, and cfd-doe-project. We create a custom role, called “Slurm GCP User” at the organization level, which provides minimal permissions to allow non-root ssh access through OS login. The Slurm GCP User custom role is given to the CFD research team group at the CFD-research-team folder level of the organization hierarchy.

Similarly, the Biomechanics research team resource collection consists of the biomech-controller-project, biomech-nih-project, and biomech-nsf-project with the Slurm GCP User role given to the Biomechanics team at the BIOMECH-research team folder level.

With this configuration of IAM policies, users in the CFD research team are prevented from accessing the Biomechanics teams’ resources, and vice-versa.


Network Resources

The slurm-gcp-host-project, in the admin user collection, is marked as the VPC host project. The slurm-gcp-host-network is created within this project with subnetworks for each research team. Firewall rules are configured to allow users to use ssh into VMs within their respective subnets.

Figure 3 : Schematic of the admin resource collection with the shared VPC network and open network routes.

Multi-Project Fluid-Slurm-GCP Cluster

Within each research teams’ resource collection, we have configured a fluid-slurm-gcp cluster with a login node and compute node deployed to the controller project. Each cluster has two compute partitions, each aligned with a GCP project and research grant. When users submit jobs to a partition, compute nodes are created within the appropriate GCP project.

Compute Engine VMs for team isolated within their respective subnetworks to prevent resource access between teams.

Figure 4: Schematic of the CFD-research team resource collection and the multi-project fluid-slurm-gcp cluster deployed across the team’s GCP projects.


Orbitera Integration

All of the projects defined by the resource collections are given the same billing account. Compute Engine resources are separated into distinct GCP projects to allow for segregation of resource expenses. Further, since each team has projects aligned with individual research grants, we are able to generate billing breakdowns for each grant.

We’ve generated an example user-facing dashboard for the cfd-research group that gives a breakdown of costs by their controller and compute projects.

To build this configuration, the costs for the cfd-controller-project, cfd-nsf-project, and cfd-doe-project are aligned with the CFD Research Team customer in Orbitera, which allows the billing administrator to create a dashboard that gives total costs, a time-series plot of project costs, and plots depicting a breakdown of spend by product, cloud provider, and account number/GCP project (Figure 5).

Figure 5: This screenshot shows an example Orbitera dashboard for the cfd-research team (from our example). This dashboard provides a high level view of total costs for a given time period in addition to time-series plots and breakdowns of costs by project for the cfd-research team.

The reports section of the Orbitera panel (Figure 6) allows customers to obtain detailed records of charges by SKUs from the cloud provider.

Figure 6: This screenshot shows the detailed GCP Billing Report for the CFD research team. The report provides a breakdown by project, and cloud SKUs, for a user provided date range. Note that this billing report can be exported as a CSV.

Summary

This article provided an overview of the components that make up the multi-project fluid-slurm-gcp system. I’ve shown, through an example, how these components can be put together to manage multiple cloud-native HPC clusters while segregating billing charges for individual teams and research grants.

The fluid-slurm-gcp is a flexible cloud-HPC system that is capable of meeting the technical and business requirements for academic and government institutions.

If you’re interested in working with Fluid Numerics and the fluid-slurm-gcp, reach out to sales@fluidnumerics.com . We can’t wait to help your organization meet their HPC goals in the cloud !