Fine-tuning a Hugging Face Diffusion Model on CycleCloud Workspace for Slurm
Introduction
Azure CycleCloud Workspace for Slurm (CCWS) is a new Infrastructure as a Service (IaaS) solution which allows the users to purpose build the environment for their distributed AI training and HPC simulation needs. This offering is available through Azure marketplace to simplify and streamline the creation plus management of Slurm clusters on Azure. Users can easily create and configure pre-defined Slurm clusters with Azure CycleCloud, without requiring any prior knowledge of the cloud or Slurm giving a full control of the infrastructure to the user. Slurm clusters will be pre-configured with PMix v4, Pyxis and enroot to support containerized AI Slurm jobs and users can bring their own libraries or customize the cluster with the choice of their software and libraries to suit their needs. Users can access the provisioned login node (or scheduler node) using SSH or Visual Studio Code to perform common tasks like submitting and managing Slurm jobs. This blog will outline the steps to fine-tune a diffusion model from Hugging Face using CCWS.
Deploy Azure CycleCloud Workspace for Slurm
To deploy Azure CCWS from Azure Marketplace follow the steps mentioned here.
Our provisioned environment for CCWS comprises of the following
(1) One Login node of F4sv2 and one Scheduler Node of D4asv4 SKU with Slurm 23.11.7.
(2) Up to 8 on-demand GPU nodes of NDasrA100v4 SKU in gpu partition.
(3) Up to 16 on-demand HPC nodes of HBv3 SKU in hpc partition
(4) Up to 10 on-demand HTC nodes of NCA100v4 in htc partition.
(5) 1 TB shared NFS.
Note: For Steps (1) to (4) above it is important to request for quota from the Azure portal before CCWS is deployed.
Once CCWS is deployed, SSH into the scheduler node. We can query on the mounted storage using
df -h
This cluster has 1 TB of shared NFS available on all the nodes.
Prepare the environment
Once CCWS is deployed SSH into the login node.
(1) Install the conda environment and the tools required on the Shared NFS. In our environment, the shared NFS is mounted as /shared.
First we will get the miniconda installation.
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Install miniconda on /shared
$ bash Miniconda3-latest-Linux-x86_64.sh
(2) Once conda is done, install Python 3.8. Note, this code may have library conflicts if newer versions of Python are chosen.
$ conda install python=3.8
(3) Clone the hugging face repository and install the required libraries.
$ git clone https://github.com/huggingface/diffusers
$cd diffusers
$ pip install .
(4) Move to the examples folder and install the packages as per the requirements.txt file
Prepare the Slurm Submission Script
Below is the slurm submission script which has been used to initiate a multi-gpu training job. The provisioned GPU is a NDasrA100v4 SKU which has 8 GPUs
The job can be submitted using sbatch and can be queried using squeue. Azure CCWS will take few minutes (6-7) to provision a GPU node once job is submitted post which the job starts running immediately.
Querying sinfo will list us the nodes which are allocated for our job and also the nodes available. It is important to note that the users are only charged for the nodes which are in ‘alloc’ state and not those in ‘idle’.
The NODELIST provides a list of the nodes, and the user can ssh into them from the login node using the nodenames listed in the table below
ssh ccw-gpu-1
nvidia-smi
Inferencing on the model
Once fine tuning is complete, the model can be tested for inferencing. We test this using a simple python script but there are several avenues available to inference.
The above script infers the Yoda character to our prompt input “yoda” using the fine-tuned model.
Conclusion
This blog has outlined the steps to fine-tune a diffusion model from Hugging Face using Azure CycleCloud Workspace for Slurm. The environment is secure, fully customizable and allows the user to bring their own code, software packages, select their choice of virtual machines including high-performant storage choices (such as the Azure Managed Lustre) and the ability to deploy in their own Virtual Network. The operating system images (Ubuntu 20.04 in this case) can be fetched from the Azure marketplace and are preconfigured with Infiniband drivers, MPI libraries, NCCL communicators, CUDA toolkit making the cluster ready-to-use within minutes. Importantly, the GPU VMs are on-demand (auto-scaled) only charging the user for the duration of the usage.
Microsoft Tech Community – Latest Blogs –Read More