Fine-tuning a Hugging Face Diffusion Model on CycleCloud Workspace for Slurm

Introduction

Azure CycleCloud Workspace for Slurm (CCWS) is a new Infrastructure as a Service (IaaS) solution which allows the users to purpose build the environment for their distributed AI training and HPC simulation needs. This offering is available through Azure marketplace to simplify and streamline the creation plus management of Slurm clusters on Azure. Users can easily create and configure pre-defined Slurm clusters with Azure CycleCloud, without requiring any prior knowledge of the cloud or Slurm giving a full control of the infrastructure to the user. Slurm clusters will be pre-configured with PMix v4, Pyxis and enroot to support containerized AI Slurm jobs and users can bring their own libraries or customize the cluster with the choice of their software and libraries to suit their needs. Users can access the provisioned login node (or scheduler node) using SSH or Visual Studio Code to perform common tasks like submitting and managing Slurm jobs. This blog will outline the steps to fine-tune a diffusion model from Hugging Face using CCWS.

Deploy Azure CycleCloud Workspace for Slurm

To deploy Azure CCWS from Azure Marketplace follow the steps mentioned here.

Our provisioned environment for CCWS comprises of the following

(1) One Login node of F4sv2 and one Scheduler Node of D4asv4 SKU with Slurm 23.11.7.

(2) Up to 8 on-demand GPU nodes of NDasrA100v4 SKU in gpu partition.

(3) Up to 16 on-demand HPC nodes of HBv3 SKU in hpc partition

(4) Up to 10 on-demand HTC nodes of NCA100v4 in htc partition.

(5) 1 TB shared NFS.

Note: For Steps (1) to (4) above it is important to request for quota from the Azure portal before CCWS is deployed.

Once CCWS is deployed, SSH into the scheduler node. We can query on the mounted storage using

df -h

This cluster has 1 TB of shared NFS available on all the nodes.

Prepare the environment

Once CCWS is deployed SSH into the login node.

(1) Install the conda environment and the tools required on the Shared NFS. In our environment, the shared NFS is mounted as /shared.

First we will get the miniconda installation.

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Install miniconda on /shared

$ bash Miniconda3-latest-Linux-x86_64.sh

(2) Once conda is done, install Python 3.8. Note, this code may have library conflicts if newer versions of Python are chosen.

$ conda install python=3.8

(3) Clone the hugging face repository and install the required libraries.

$ git clone https://github.com/huggingface/diffusers
$cd diffusers
$ pip install .

(4) Move to the examples folder and install the packages as per the requirements.txt file

Prepare the Slurm Submission Script

Below is the slurm submission script which has been used to initiate a multi-gpu training job. The provisioned GPU is a NDasrA100v4 SKU which has 8 GPUs

The job can be submitted using sbatch and can be queried using squeue. Azure CCWS will take few minutes (6-7) to provision a GPU node once job is submitted post which the job starts running immediately.

Querying sinfo will list us the nodes which are allocated for our job and also the nodes available. It is important to note that the users are only charged for the nodes which are in ‘alloc’ state and not those in ‘idle’.

The NODELIST provides a list of the nodes, and the user can ssh into them from the login node using the nodenames listed in the table below

ssh ccw-gpu-1
nvidia-smi

Inferencing on the model

Once fine tuning is complete, the model can be tested for inferencing. We test this using a simple python script but there are several avenues available to inference.

The above script infers the Yoda character to our prompt input “yoda” using the fine-tuned model.

Conclusion

This blog has outlined the steps to fine-tune a diffusion model from Hugging Face using Azure CycleCloud Workspace for Slurm. The environment is secure, fully customizable and allows the user to bring their own code, software packages, select their choice of virtual machines including high-performant storage choices (such as the Azure Managed Lustre) and the ability to deploy in their own Virtual Network. The operating system images (Ubuntu 20.04 in this case) can be fetched from the Azure marketplace and are preconfigured with Infiniband drivers, MPI libraries, NCCL communicators, CUDA toolkit making the cluster ready-to-use within minutes. Importantly, the GPU VMs are on-demand (auto-scaled) only charging the user for the duration of the usage.

Microsoft Tech Community – Latest Blogs –Read More

Cart

Cart