WS23 971007 VU Advanced Methods and its Applications - Performance topics

Setup vHPC

Overview
Cloud provider
Setup of the provider
Create your infrastructure
Configure the infrastructure
Use the infrastructure

Overview

We use the setup of the vHPC as an example of how some aspects of working with the cloud looks like.

Cloud provider

In order to access resources in the cloud some kind of management software is needed. There are several options available, the big public clouds providers

Amazone aws
Microsoft Azure
Google GCP

or an example for on premises (private)

openstack

Now we do not have such an infrastructure at our disposal but we can still simulate one. We do this with a somewhat powerful server and libvirt, this will be our cloud provider.

As mentioned before there is a management layer present and that needs to be installed. For on premise clouds this includes installing the cloud software (openstack) and adding servers, storage, ... to the managed resources. In our case we install libvirt on Ubuntu.

Create your infrastructure

Now we need to define our infrastructure. In our case we would like to have

one Slurm management node
several Slurm worker nodes
storage for each virtual machine
an operating system on each machine - we use Rocky Linux 8
initial setup of the nodes to allow further configuration (service account)
a network for all the machines

We describe all of this using the descriptive language Terraform that is often used to define infrastructure. Luckily for us, there is a Terraform provider for libvirt.

The actual code is yml, e.g. below you see the final part of defining the worker nodes.

resource "libvirt_domain" "worker" {
  count      = var.count_worker_nodes
  name       = "${format("${var.os_prefix}-${var.w_short}-%02d", count.index + 1)}.${var.domain}"
  memory     = var.memory
  vcpu       = var.vcpu
  cloudinit  = element(libvirt_cloudinit_disk.commoninit_worker[*].id, count.index)
  disk {
    volume_id = element(libvirt_volume.worker[*].id, count.index)
  }
  network_interface {
    network_name = var.network
    wait_for_lease = true
  }
  console {
    type        = "pty"
    target_port = "0"
    target_type = "serial"
  }
  console {
    type        = "pty"
    target_type = "virtio"
    target_port = "1"
  }
  graphics {
    type        = "spice"
    listen_type = "address"
    autoport    = true
  }

  connection {
    type        = "ssh"
    private_key = file("~/.ssh/id_rsa")
    user        = var.user_name
    timeout     = "2m"
    host        = format("${var.os_prefix}-${var.w_short}-%02d", count.index + 1)
  }
  cpu {
    mode = "host-passthrough"
  }
}

With our Infrastructure as Code we can use Terraform to deploy it and with that we have our machines. The minimal configuration to have a service account for the next step is done via cloud-init.

This can easily be automated and is often called Infrastructure as a Service or IaaS.

Configure the infrastructure

All we have now is a couple of virtual machines (VMs) running Linux on them and an account that can be accessed via ssh. The next step is to actually configure them into a working HPC cluster. For this we use Ansible from RedHat.

Again we write some yml files that do the actual configuration. More precisely we used the excellent ansible scripts provided by the elasticluster project. Unfortunately, this project is already a bit stale but with some adaptations it works for our setup.

The main idea is to define playbooks. A playbook usually makes sure to configure a certain service, in our case the vHPC. The service has different roles so we subdivide down to roles. For example the Slurm playbook is split into three parts:

slurm-common $\to$ needed for both machines
slurm-master $\to$ just for the management node
slurm-worker $\to$ just for the worker nodes

We can cluster the different VMs per role and run the playbook for each role. The entire elasticluster is much bigger. To allow reuse we further subdivide into more specific services and tasks. It is also possible to provide different ways for different platforms, e.g. RHEL or Debian based Linux distributions.

This is an excerpt of the slurm.yml playbook^[1]

---

- name: Slurm master Playbook
  hosts: slurm_master
  roles:
    - role: 'nis'
      NIS_MASTER: "{{groups.slurm_master[0]}}"
      when: 'multiuser_cluster|default("true")|bool'
    - role: 'nfs-server'
      NFS_EXPORTS:
        - path: '/home'
          clients: "{{groups.slurm_worker + groups.slurm_submit|default([])}}"
    - slurm-master


- name: Slurm worker nodes Playbook
  hosts: slurm_worker
  roles:
    - role: 'nis'
      NIS_MASTER: "{{groups.slurm_master[0]}}"
      when: 'multiuser_cluster|default("true")|bool'
    - role: 'nfs-client'
      NFS_MOUNTS:
        - fs: '{{groups.slurm_master[0]}}:/home'
          mountpoint: '/home'
    - slurm-worker

Ansible itself runs on the client and connects to the nodes via the service account and ssh. Here the different tasks are executed on the nodes and after they are finished ansible removes the tasks again, leaving a configured system, more details^[2].

This part is often summarized under the umbrella of Platform as a Service.

Use the infrastructure

Now that we have deployed and configured our service we can use it.

Visualization of the setup for the virtual High Performance Computing System.

Once all pieces work together the virtual HPC can be brought to life with the following sequence of commands:

terraform plan -out planfile
terraform apply -auto-approve planfile
ansible-playbook --inventory inventory.yml --become --key-file id_rsa main.yml

and destroyed with

terraform destroy -auto-approve

⚠ Note

This virtual HPC will not make the Top500 ;). We will not see performance gains in comparison to running code locally, but we can get experience in how to work with a (Slurm) cluster.

[1]	Original source on github

[2]	How does Ansible work?

CC BY-NC-SA 4.0 - Gregor Ehrensperger, Peter Kandolf. Last modified: January 18, 2024. Website built with Franklin.jl and the Julia programming language.