Creating a SLURM Cluster for Scheduling NVIDIA MIG-Based GPU Accelerated workloads

Today, researchers and developers often use a dedicated GPU for their workloads, even when only a fraction of the GPU’s compute power is needed. The NVIDIA A100, A30, and H100 Tensor Core GPUs introduce a revolutionary feature called Multi-Instance GPU (MIG). MIG partitions the GPU into up to seven instances, each with its own dedicated compute, memory, and bandwidth. This enables multiple users to run their workloads on the same GPU, maximizing per-GPU utilization and boosting user productivity.

In this blog, we will guide you through the process of creating a SLURM cluster and integrating NVIDIA’s Multi-Instance GPU (MIG) feature to efficiently schedule GPU-accelerated jobs. We will cover the installation and configuration of SLURM, as well as the setup of MIG on NVIDIA GPUs.

Overview:

SLURM (Simple Linux Utility for Resource Management) is an open-source job scheduler used by many of the world’s supercomputers and HPC (High-Performance Computing) clusters. It facilitates the allocation of resources such as CPUs, memory, and GPUs to users and their jobs, ensuring efficient use of available hardware. SLURM provides robust workload management capabilities, including job queuing, prioritization, scheduling, and monitoring.

MIG (Multi-Instance GPU) is a feature introduced by NVIDIA for its A100 and H100 Tensor Core GPUs, allowing a single physical GPU to be partitioned into multiple independent GPU instances. Each MIG instance operates with dedicated memory, cache, and compute cores, enabling multiple users or applications to share a single GPU securely and efficiently. This capability enhances resource utilization and provides a level of flexibility and isolation not previously possible with traditional GPUs.

Advantages of Using NVIDIA MIG (Multi-Instance GPU):

Improved Resource Utilization
Maximizes GPU Usage: MIG allows you to run multiple smaller workloads on a single GPU, ensuring that the GPU’s resources are fully utilized. This is especially useful for applications that do not need the full capacity of a GPU.
Cost Efficiency: By enabling multiple instances on a single GPU, organizations can achieve better cost-efficiency, reducing the need to purchase additional GPUs.
Workload Isolation
– Security and Stability: Each GPU instance is fully isolated, ensuring that workloads do not interfere with each other. This is critical for multi-tenant environments where different users or applications might run on the same physical hardware.
– Predictable Performance: Isolation ensures consistent and predictable performance for each instance, avoiding resource contention issues.
Scalability and Flexibility
– Adaptability: MIG allows dynamic partitioning of GPU resources, making it easy to scale workloads up or down based on demand. You can allocate just the right amount of resources needed for different tasks.
– Multi-Tenant Support: Ideal for cloud service providers and data centers that host services for multiple customers, each requiring different levels of GPU resources.
Simplified Management
– Administrative Control: Administrators can use NVIDIA tools to easily configure, manage, and monitor the GPU instances. This includes allocating specific memory and compute resources to each instance.
– Automated Management: Tools and software can automate the allocation and management of GPU resources, reducing the administrative overhead.
Enhanced Performance for Diverse Workloads
– Support for Various Applications: MIG supports a wide range of applications, from AI inference and training to data analytics and virtual desktops. This makes it versatile for different types of computational workloads.
– Optimized Performance: By running multiple instances optimized for specific tasks, you can achieve better overall performance compared to running all tasks on a single monolithic GPU.
Better Utilization in Shared Environments
– Educational and Research Institutions: In environments where GPUs are shared among students or researchers, MIG allows multiple users to access GPU resources simultaneously without impacting each other’s work.
– Development and Testing: Developers can use MIG to test and develop applications in an environment that simulates multi-GPU setups without requiring multiple physical GPUs.

By leveraging the power of NVIDIA’s MIG feature within a SLURM-managed cluster, you can significantly enhance the efficiency and productivity of your GPU-accelerated workloads. Join us as we delve into the steps for setting up this powerful combination and unlock the full potential of your computational resources.

Prerequisites

Scheduler:

Size: Standard D4s v5 (4 vCPUs, 16 GiB memory)
Image: Ubuntu-HPC 2204 – Gen2 (Ubuntu 22.04)
Scheduling software: Slurm 23.02.7-1

Execute VM:

Size: Standard NC40ads H100 v5 (40 vCPUs, 320 GiB memory)
Image: Ubuntu-HPC 2204 – Gen2 (Ubuntu 22.04) – Image contains Nvidia GPU driver.

It is recommended to install the latest NVIDIA GPU driver. The minimum versions are provided below:

If using H100, then CUDA 12 and NVIDIA driver R525 ( >= 525.53) or later
If using A100/A30, then CUDA 11 and NVIDIA driver R450 ( >= 450.80.02) or later

Scheduling software: Slurm 23.02.7-1

Slurm Scheduler setup:

Step 1: First, create users for Munge and SLURM services to manage their operations securely.

groupadd -g 11101 munge
useradd -u 11101 -g 11101 -s /bin/false -M munge
groupadd -g 11100 slurm
useradd -u 11100 -g 11100 -s /bin/false -M slurm

Step 2: Setup NFS Server on Scheduler

NFS will be used to share configuration files across the cluster.

apt install nfs-kernel-server -y
mkdir -p /sched /shared/home
echo “/sched *(rw,sync,no_root_squash)” >> /etc/exports
echo “/shared *(rw,sync,no_root_squash)” >> /etc/exports
systemctl restart nfs-server
systemctl enable nfs-server.service
showmount -e

Step 3: Install and Configure Munge

Munge is used for authentication across the SLURM cluster.

apt install -y munge
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
cp /etc/munge/munge.key /sched/
chown munge:munge /sched/munge.key
chmod 400 /sched/munge.key
systemctl restart munge
systemctl enable munge

Step 4: Install and Configure SLURM on Scheduler

Installing Slurm Scheduler daemon and setting up the directories for slurm.

apt install slurm-slurmctld -y
mkdir -p /etc/slurm /var/spool/slurmctld /var/log/slurmctld
chown slurm:slurm /etc/slurm /var/spool/slurmctld /var/log/slurmctld

Creating the `slurm.conf` file. Alternatively, you can generate the file using the Slurm configurator tool.

cat <<EOF > /sched/slurm.conf
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=2
PropagateResourceLimits=ALL
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
GresTypes=gpu
ClusterName=mycluster
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=debug
SlurmctldLogFile=/var/log/slurmctld/slurmctld.log
SlurmctldParameters=idle_on_node_suspend
SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd/slurmd.log
PrivateData=cloud
TreeWidth=65533
ResumeTimeout=1800
SuspendTimeout=600
SuspendTime=300
SchedulerParameters=max_switch_wait=24:00:00
Include accounting.conf
Include partitions.conf
EOF

echo “SlurmctldHost=$(hostname -s)” >> /sched/slurm.conf

Creating cgroup.conf for Slurm:

This command creates a configuration file named cgroup.conf in the /sched directory with specific settings for Slurm’s cgroup resource management.

cat <<EOF > /sched/cgroup.conf
CgroupAutomount=no
ConstrainCores=yes
ConstrainRamSpace=yes
ConstrainDevices=yes
EOF

Configuring Accounting Storage Type for Slurm:

echo “AccountingStorageType=accounting_storage/none” >> /sched/accounting.conf

Changing Ownership of Configuration Files:

chown slurm:slurm /sched/*.conf

Creating Symbolic Links for Configuration Files:

ln -s /sched/slurm.conf /etc/slurm/slurm.conf
ln -s /sched/cgroup.conf /etc/slurm/cgroup.conf
ln -s /sched/accounting.conf /etc/slurm/accounting.conf

Configure the Execute VM

Check and Enable NVIDIA GPU Driver and MIG Mode. more details on Nvidia MIG can be found in Nvidia MIG documentation

Ensure the GPU driver is installed. The Ubuntu HPC 2204 image includes the Nvidia GPU driver. If you don’t have the GPU driver, make sure to install it. Here are the commands to enable Nvidia GPU MIG mode:

root@h100vm:~# nvidia-smi -pm 1
Enabled persistence mode for GPU 00000001:00:00.0.
All done.
root@h100vm:~# nvidia-smi -mig 1
Enabled MIG Mode for GPU 00000001:00:00.0
All done.

2. Check supported profiles and create MIG partitions.

The following command check the supported MIG mode in Nvidia H100 GPU.

root@h100vm:~# nvidia-smi mig -lgip
+—————————————————————————–+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
|=============================================================================|
| 0 MIG 1g.12gb 19 7/7 10.75 No 16 1 0 |
| 1 1 0 |
+—————————————————————————–+
| 0 MIG 1g.12gb+me 20 1/1 10.75 No 16 1 0 |
| 1 1 1 |
+—————————————————————————–+
| 0 MIG 1g.24gb 15 4/4 21.62 No 26 1 0 |
| 1 1 0 |
+—————————————————————————–+
| 0 MIG 2g.24gb 14 3/3 21.62 No 32 2 0 |
| 2 2 0 |
+—————————————————————————–+
| 0 MIG 3g.47gb 9 2/2 46.38 No 60 3 0 |
| 3 3 0 |
+—————————————————————————–+
| 0 MIG 4g.47gb 5 1/1 46.38 No 64 4 0 |
| 4 4 0 |
+—————————————————————————–+
| 0 MIG 7g.94gb 0 1/1 93.12 No 132 7 0 |
| 8 7 1 |
+—————————————————————————–+

Create the MIG partitions using the following command. In this example, we are creating 4 MIG partitions using the 1g.24gb profile.

root@h100vm:~# nvidia-smi mig -cgi 15,15,15,15 -C
Successfully created GPU instance ID 6 on GPU 0 using profile MIG 1g.24gb (ID 15)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 6 using profile MIG 1g.24gb (ID 7)
Successfully created GPU instance ID 5 on GPU 0 using profile MIG 1g.24gb (ID 15)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 5 using profile MIG 1g.24gb (ID 7)
Successfully created GPU instance ID 3 on GPU 0 using profile MIG 1g.24gb (ID 15)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 3 using profile MIG 1g.24gb (ID 7)
Successfully created GPU instance ID 4 on GPU 0 using profile MIG 1g.24gb (ID 15)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 4 using profile MIG 1g.24gb (ID 7)
root@h100vm:~# nvidia-smi
Fri Jul 5 06:32:39 2024
+—————————————————————————————+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|—————————————–+———————-+———————-+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 NVL On | 00000001:00:00.0 Off | On |
| N/A 38C P0 61W / 400W | 51MiB / 95830MiB | N/A Default |
| | | Enabled |
+—————————————–+———————-+———————-+

+—————————————————————————————+
| MIG devices: |
+——————+——————————–+———–+———————–+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 3 0 0 | 12MiB / 22144MiB | 26 0 | 1 0 1 0 1 |
| | 0MiB / 32767MiB | | |
+——————+——————————–+———–+———————–+
| 0 4 0 1 | 12MiB / 22144MiB | 26 0 | 1 0 1 0 1 |
| | 0MiB / 32767MiB | | |
+——————+——————————–+———–+———————–+
| 0 5 0 2 | 12MiB / 22144MiB | 26 0 | 1 0 1 0 1 |
| | 0MiB / 32767MiB | | |
+——————+——————————–+———–+———————–+
| 0 6 0 3 | 12MiB / 22144MiB | 26 0 | 1 0 1 0 1 |
| | 0MiB / 32767MiB | | |
+——————+——————————–+———–+———————–+

+—————————————————————————————+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+—————————————————————————————+

Create Munge and SLURM users on the execute VM

groupadd -g 11101 munge
useradd -u 11101 -g 11101 -s /bin/false -M munge
groupadd -g 11100 slurm
useradd -u 11100 -g 11100 -s /bin/false -M slurm

4. Mount NFS Shares from Scheduler (Use Scheduler IP address)

mkdir /shared /sched
mount <scheduler ip>:/sched /sched
mount <scheduler ip>:/shared /shared

5. Install and Configure Munge

apt install munge -y
cp /sched/munge.key /etc/munge/
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
systemctl restart munge.service

6. Install and Configure SLURM on execute VM

apt install slurm-slurmd -y
mkdir -p /etc/slurm /var/spool/slurmd /var/log/slurmd
chown slurm:slurm /etc/slurm /var/spool/slurmd /var/log/slurmd
chown slurm:slurm /etc/slurm/
ln -s /sched/slurm.conf /etc/slurm/slurm.conf
ln -s /sched/cgroup.conf /etc/slurm/cgroup.conf
ln -s /sched/accounting.conf /etc/slurm/accounting.conf

Create GRES Configuration for MIG. The following steps show how to use the Mig Detection program and use a single H100 system as an example.

git clone https://gitlab.com/nvidia/hpc/slurm-mig-discovery.git
cd slurm-mig-discovery
gcc -g -o mig -I/usr/local/cuda/include -I/usr/cuda/include mig.c -lnvidia-ml
./mig

8. check the GRES config file.

root@h100vm:~/slurm-mig-discovery# cat gres.conf
# GPU 0 MIG 0 /proc/driver/nvidia/capabilities/gpu0/mig/gi3/access
Name=gpu Type=1g.22gb File=/dev/nvidia-caps/nvidia-cap30

# GPU 0 MIG 1 /proc/driver/nvidia/capabilities/gpu0/mig/gi4/access
Name=gpu Type=1g.22gb File=/dev/nvidia-caps/nvidia-cap39

# GPU 0 MIG 2 /proc/driver/nvidia/capabilities/gpu0/mig/gi5/access
Name=gpu Type=1g.22gb File=/dev/nvidia-caps/nvidia-cap48

# GPU 0 MIG 3 /proc/driver/nvidia/capabilities/gpu0/mig/gi6/access
Name=gpu Type=1g.22gb File=/dev/nvidia-caps/nvidia-cap57

9. copy the generated configuration file to central location.

cp gres.conf cgroup_allowed_devices_file.conf /sched/
chown slurm:slurm /sched/cgroup_allowed_devices_file.conf
chown slurm:slurm /sched/gres.conf

10. create symlinks to slurm configuration directory.

ln -s /sched/cgroup_allowed_devices_file.conf /etc/slurm/cgroup_allowed_devices_file.conf
ln -s /sched/gres.conf /etc/slurm/gres.conf

11. create slurm partitions file. This command creates a configuration file named `partitions.conf` in the `/sched` directory. It defines:
– A GPU partition named `gpu` on node `h100vm` with default settings.
– The node `h100vm` has 40 CPUs, 1 board, 1 socket per board, 40 cores per socket, and 1 thread per core.
– It has a real memory of 322243 MB.
– GPU resources are specified with 4 partitions using the `gpu:1g.22gb` profile.

cat << ‘EOF’ > /sched/partitions.conf
PartitionName=gpu Nodes=h100vm Default=YES MaxTime=INFINITE State=UP
NodeName=h100vm CPUs=40 Boards=1 SocketsPerBoard=1 CoresPerSocket=40 ThreadsPerCore=1 RealMemory=322243 Gres=gpu:1g.22gb:4
EOF

12. setting the permission for partitions.conf and creating a symlink to slurm configuration directory.

chown slurm:slurm /sched/partitions.conf
ln -s /sched/partitions.conf /etc/slurm/partitions.conf

Finalize and Start the SLURM Services

On Scheduler:

ln -s /sched/partitions.conf /etc/slurm/partitions.conf
ln -s /sched/cgroup_allowed_devices_file.conf /etc/slurm/cgroup_allowed_devices_file.conf
ln -s /sched/gres.conf /etc/slurm/gres.conf
systemctl restart slurmctld
systemctl enable slurmctld

On Execute VM

systemctl restart slurmd
systemctl enable slurmd

Check sinfo command on scheduler VM to verify the slurm configuration.

root@scheduler:~# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu* up infinite 1 idle h100vm

Testing the job and functionality

1. To submit the job, first create a test user. In this example, we’ll create a test user named `vinil` for testing purposes. Start by creating the user on the scheduler and then on the execute VM. We have set up an NFS server to share the `/shared` directory, which will serve as the centralized home directory for the user.

# On Scheduler VM
useradd -m -d /shared/home/vinil -u 20001 vinil
# Execute VM
useradd -d /shared/home/vinil -u 20001 vinil

On Scheduler VM:

2. I am using the CIFAR-10 training model to run tests on the 4 MIG instances we created. I will set up an Anaconda environment to run the CIFAR-10 job. This involves installing the TensorFlow GPU machine learning libraries and running 4 jobs simultaneously on a single node using Slurm to demonstrate the capabilities of MIG partitions and GPU workload scheduling on MIG partitions.

# Download and install Anaconda software.
curl -O https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
chmod +x Anaconda3-2024.06-1-Linux-x86_64.sh
sh Anaconda3-2024.06-1-Linux-x86_64.sh -b

3. Create a Conda environment named `mlprog` and install the TensorFlow GPU libraries.

#Setting the PATH and creating a conda environment called mlprog enviornment.
export PATH=$PATH:/shared/home/vinil/anaconda3/bin
/shared/home/vinil/anaconda3/bin/conda init
source ~/.bashrc
/shared/home/vinil/anaconda3/bin/conda create -n mlprog tensorflow-gpu -y

4. The following code will download the `cifar10.py` script, which contains the CIFAR-10 image classification machine learning code written using TensorFlow.

#Download the CIFAR10 code.
wget https://raw.githubusercontent.com/vinil-v/slurm-mig-setup/main/test_job_setup/cifar10.py

5. Create a job submission script named `mljob.sh` to run the job on a GPU using the Slurm scheduler. This script is designed to submit a job named `MLjob` to the GPU partition (`–partition=gpu`) of the Slurm scheduler. It allocates 10 tasks (`–ntasks=10`) and specifies GPU resources (`–gres=gpu:1g.22gb:1`). The script sets up the environment by adding Conda to the PATH and activating the `mlprog` Conda environment before executing the `cifar10.py` script to perform CIFAR-10 image classification using TensorFlow.

#!/bin/sh
#SBATCH –job-name=MLjob
#SBATCH –partition=gpu
#SBATCH –ntasks=10
#SBATCH –gres=gpu:1g.22gb:1

export PATH=$PATH:/shared/home/vinil/anaconda3/bin/conda
source /shared/home/vinil/anaconda3/bin/activate mlprog
python cifar10.py

6. Submit the job using the `sbatch` command and execute 4 instances of the job using the same `mljob.sh` script. This method will fully utilize all 4 MIG partitions available on the node. After submission, use the `squeue` command to check the status. You will observe all 4 jobs in the Running state.

(mlprog) vinil@scheduler:~$ sbatch mljob.sh
Submitted batch job 7
(mlprog) vinil@scheduler:~$ sbatch mljob.sh
Submitted batch job 8
(mlprog) vinil@scheduler:~$ sbatch mljob.sh
Submitted batch job 9
(mlprog) vinil@scheduler:~$ sbatch mljob.sh
Submitted batch job 10
(mlprog) vinil@scheduler:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7 gpu MLjob vinil R 0:05 1 h100vm
8 gpu MLjob vinil R 0:01 1 h100vm
9 gpu MLjob vinil R 0:01 1 h100vm
10 gpu MLjob vinil R 0:01 1 h100vm

7. Log in to the execution VM and execute the `nvidia-smi` command. You will observe that all 4 MIG GPU partitions are allocated to the jobs and are currently running.uj

+—————————————————————————————+
| MIG devices: |
+——————+——————————–+———–+———————–+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 3 0 0 | 20846MiB / 22144MiB | 26 0 | 1 0 1 0 1 |
| | 2MiB / 32767MiB | | |
+——————+——————————–+———–+———————–+
| 0 4 0 1 | 20846MiB / 22144MiB | 26 0 | 1 0 1 0 1 |
| | 2MiB / 32767MiB | | |
+——————+——————————–+———–+———————–+
| 0 5 0 2 | 20850MiB / 22144MiB | 26 0 | 1 0 1 0 1 |
| | 2MiB / 32767MiB | | |
+——————+——————————–+———–+———————–+
| 0 6 0 3 | 20850MiB / 22144MiB | 26 0 | 1 0 1 0 1 |
| | 2MiB / 32767MiB | | |
+——————+——————————–+———–+———————–+

Conclusion:

You have now successfully set up a SLURM cluster with NVIDIA MIG integration. This setup allows you to efficiently schedule and manage GPU jobs, ensuring optimal utilization of resources. With SLURM and MIG, you can achieve high performance and scalability for your computational tasks. Happy computing!

Microsoft Tech Community – Latest Blogs –Read More

Cart

Cart