Fine-tune/Evaluate/Quantize SLM/LLM using the torchtune on Azure ML
In this blog, we’ll explore how to leverage torchtune on Azure ML to fine-tune, evaluate, and quantize small and large language models (SLM/LLM) effectively.
As demand for adaptable and efficient language models grows, there’s a need for robust tools that make model fine-tuning and optimization more accessible. torchtune is a versatile library that simplifies these processes, offering support for distributed training, flexible logging, and model quantization. Azure ML complements torchtune by providing scalable infrastructure and integration options, making it an ideal platform for experimenting with and deploying SLM/LLMs.
This guide provides hands-on code examples and step-by-step instructions for:
- Setting up Azure ML to work with torchtune for distributed model fine-tuning.
- Handling dynamic path adjustments in the YAML recipe, particularly useful for Azure’s storage-mounted environments.
- Applying quantization techniques to optimize models for deployment on resource-limited devices.
By the end of this guide, you’ll be equipped to run scalable and efficient language model pipelines using torchtune on Azure ML, enhancing your model’s performance and accessibility.
Hands-on Labs: https://github.com/Azure/torchtune-azureml
1. Introduction
1.1. torchtune
torchtune is a Python library designed to simplify fine-tune SLM/LLM models using PyTorch. torchtune stands out for its simplicity and flexibility, enabling users to perform fine-tuning, evaluation, and quantization effortlessly with minimal code through YAML-based recipes. This intuitive setup allows users to define and adjust complex training configurations in a structured, readable format, reducing the need for extensive code changes. By centralizing settings into a YAML recipe, torchtune not only speeds up the experimentation process but also makes it easy to replicate or modify configurations across different models and tasks. This approach is ideal for streamlining model optimization, ensuring that fine-tuning and deployment processes are both quick and highly adaptable.
The representative features are as follows:
- Easy Model Tuning: torchtune is a PyTorch-native library that simplifies the SLM fine-tuning, making it accessible to users without advanced AI expertise.
- Easy Application of Distributed Training: torchtune simplifies the setup for distributed training, allowing users to scale their models across multiple GPUs with minimal configuration. This feature significantly reduces users’ trial-and-errors.
- Simplified Model Evaluation and Quantization: torchtune makes model evaluation and quantization straightforward, providing built-in support to easily assess model performance and optimize models for deployment.
- Scalability and Portability: torchtune is flexible enough to be used on various cloud platforms and local environments. It can be easily integrated with AzureML.
For more information about torchtune, please check this link.
1.2. Azure ML with torchtune
Running torchtune on AzureML offers several advantages that streamline the GenAI workflow. Here are some key benefits of using AzureML with torchtune:
- Scalability and Compute Power: Azure ML provides powerful, scalable compute resources, allowing torchtune to handle multiple SLMs/LLMs across multiple GPUs or distributed clusters. This makes it ideal for efficiently managing intensive tasks like fine-tuning and quantization on large datasets.
- Managed ML Environment: Azure ML offers a fully managed environment, so setting up dependencies and managing versions are handled with ease. This reduces setup time for torchtune, letting users focus directly on model optimization without infrastructure concerns.
- Model Deployment and Scaling: Once the model is optimized with torchtune, AzureML provides a straightforward pathway to deploy it on Azure’s cloud infrastructure, making it easy to scale applications to production with robust monitoring and scaling features.
- Seamless Integration with Other Azure Services: Users can leverage other Azure services, such as Azure Blob Storage for dataset storage or Azure SQL for data management. This ecosystem support enhances workflow efficiency and makes AzureML a powerful choice for torchtune-based model tuning and deployment.
2. torchtune YAML configuration
In a torchtune YAML configuration, each parameter and setting controls specific training aspects for fine-tuning large language models (LLMs). Here’s a breakdown of key components like supervised fine-tuning (SFT), direct preference optimization (DPO), knowledge distillation (KD), and quantization:
- SFT (Supervised Fine-Tuning): This setting manages the fine-tuning process by training the model with labeled datasets. It involves specifying the dataset path, batch size, learning rate, and the number of epochs. SFT is critical for adapting pre-trained models to specific tasks using supervised data.
- DPO (Direct Preference Optimization): This setting is for training models based on human preference data. It generally uses a reward model to rank outputs, guiding the model to optimize directly for preferred responses. In torchtune, you can easily apply DPO with the settings below.
- KD (Knowledge Distillation): In this setting, a larger, more accurate model (teacher) transfers knowledge to a smaller model (student). YAML settings might define teacher and student model paths, temperature (for smoothing probabilities), and alpha (weight for balancing loss between teacher predictions and labels). KD allows smaller models to mimic larger models’ performance while reducing computation needs. In torchtune, you can easily apply DPO with the settings below.
- Evaluation: Torchtune integrates seamlessly with EleutherAI’s LM Evaluation Harness, which allows you to evaluate the truthfulness and accuracy of your models using benchmarks like TruthfulQA. You can easily perform these evaluations using Torchtune’s eleuther_eval recipe.
- Quantization: This setting reduces model size and computational requirements by lowering the bit precision of model weights. YAML settings specify the quantization method (e.g., 8-bit or 4-bit), target layers, and possibly additional parameters for post-training quantization. This is particularly helpful for deploying models on edge devices with limited resources. In torchtune, you can easily apply DPO with the settings below.
Check out the YAML samples on torchtune’s official website.
3. Azure ML Training Life Hacks
Applying torchtune’s standalone command to Azure ML is very simple. However, applying the pipeline of hugging face model download-fine-tuning-evaluation-quantization and distributed training as expressed in the architecture requires some trial and error. So, refer to the life hacks below to minimize trial and error when applying them to your workload.
3.1. Downloading model
The torch_distributed_zero_first
decorator is used to ensure that only one process (typically rank 0 in a distributed setup) performs certain operations, such as downloading or loading a model. This approach is crucial in a distributed environment where multiple processes might attempt to load a model concurrently, which could lead to redundant downloads, excessive memory usage, or conflicts.
Here’s why torch_distributed_zero_first
is used to download the model on a single process:
- Prevent Redundant Downloads: In a distributed setup, if every process tries to download the model simultaneously, it can lead to unnecessary network traffic and redundant file storage. By ensuring that only one process downloads the model,
torch_distributed_zero_first
prevents this redundancy. - Avoid Conflicts and File Corruption: If multiple processes attempt to write or modify the same file during download, it could lead to file corruption or access conflicts.
torch_distributed_zero_first
minimizes this risk by allowing only one process to handle the file download.
After downloading, the model can be distributed or loaded into memory across all processes using standard PyTorch distributed training methods. This approach makes the model loading process more efficient and stable in multi-process environments.
3.2. Destroying process group
When applying distributed training on AzureML with torchtune’s CLI, it’s essential to manage the process groups carefully. The distributed training recipe in torchtune CLI initializes a process group using dist.init_process_group(...)
. However, if a process group is already active, initializing another one can cause conflicts, leading to nested or redundant process groups.
To prevent this, you should close any existing process groups before Torchtune’s distributed training starts. This can be done by calling dist.destroy_process_group(…)
to terminate any active process groups, ensuring a clean state. By doing so, you avoid process conflicts, enabling torchtune CLI’s distributed training recipe to operate smoothly without overlapping with pre-existing groups. Code snippets for 3.1 and 3.2 are below.
MASTER_ADDR = os.environ.get('MASTER_ADDR', '127.0.0.1')
MASTER_PORT = os.environ.get('MASTER_PORT', '7777')
WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1))
GLOBAL_RANK = int(os.environ.get('RANK', -1))
LOCAL_RANK = int(os.environ.get('LOCAL_RANK', -1))
NUM_GPUS_PER_NODE = torch.cuda.device_count()
NUM_NODES = WORLD_SIZE // NUM_GPUS_PER_NODE
if LOCAL_RANK != -1:
dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo")
@contextmanager
def torch_distributed_zero_first(local_rank: int):
"""
Decorator to make all processes in distributed training
wait for each local_master to do something.
"""
if local_rank not in [-1, 0]:
dist.barrier(device_ids=[local_rank])
yield
if local_rank == 0:
dist.barrier(device_ids=[0])
...
with torch_distributed_zero_first(LOCAL_RANK):
# Download the model
download_model(args.teacher_model_id, args.teacher_model_dir)
download_model(args.student_model_id, args.student_model_dir)
# Construct the fine-tuning command
if "single" in args.tune_recipe:
print("***** Single Device Training *****");
full_command = (
f'tune run '
f'{args.tune_recipe} '
f'--config {args.tune_config_name}'
)
# Run the fine-tuning command
run_command(full_command)
else:
print("***** Distributed Training *****");
dist.destroy_process_group()
if GLOBAL_RANK in {-1, 0}:
# Run the fine-tuning command
full_command = (
f'tune run --master-addr {MASTER_ADDR} --master-port {MASTER_PORT} --nnodes {NUM_NODES} --nproc_per_node {NUM_GPUS_PER_NODE} '
f'{args.tune_recipe} '
f'--config {args.tune_config_name}'
)
run_command(full_command)
...
3.3. Dynamic configuration
Since the path to the blob storage mounted on the computing cluster is dynamic, the YAML recipe must be modified dynamically. Here’s an example of how to adjust the configuration using Jinja templates to ensure the paths are set correctly at runtime:
# Dynamically modify fine-tuning YAML file.
import os, jinja2
jinja_env = jinja2.Environment()
template = jinja_env.from_string(Path(args.tune_config_name).open().read())
train_path = os.path.join(args.train_dir, "train.jsonl")
metric_logger = "DiskLogger"
if len(args.wandb_api_key) > 0:
metric_logger = "WandBLogger"
Path(args.tune_config_name).open("w").write(
template.render(
train_path=train_path,
log_dir=args.log_dir,
model_dir=args.model_dir,
model_output_dir=args.model_output_dir,
metric_logger=metric_logger
)
)
lora_finetune.yaml code snippet
# Model arguments
model:
...
# Tokenizer
tokenizer:
_component_: torchtune.models.phi3.phi3_mini_tokenizer
path: {{model_dir}}/tokenizer.model
max_seq_len: null
# Checkpointer
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: {{model_dir}}
checkpoint_files: [
model-00001-of-00002.safetensors,
model-00002-of-00002.safetensors
]
recipe_checkpoint: null
output_dir: {{model_output_dir}}
model_type: PHI3_MINI
resume_from_checkpoint: False
save_adapter_weights_only: False
# Dataset
dataset:
_component_: torchtune.datasets.instruct_dataset
source: json
data_files: {{train_path}}
column_map:
input: instruction
output: output
train_on_input: False
packed: False
split: train
seed: null
shuffle: True
# Logging
output_dir: {{log_dir}}/lora_finetune_output
metric_logger:
_component_: torchtune.training.metric_logging.{{metric_logger}}
log_dir: {{log_dir}}/training_logs
log_every_n_steps: 1
log_peak_memory_stats: False
...
In this setup:
- The script reads the template YAML file and dynamically injects the appropriate paths and configurations.
train_path
,log_dir
,model_dir
, andmodel_output_dir
are populated based on the environment’s dynamically assigned paths, ensuring that the YAML file reflects the actual storage locations.metric_logger
is set to"DiskLogger"
by default but changes to"WandBLogger"
if awandb_api_key
is provided, allowing for flexible metric logging configurations.
This approach guarantees that the configuration is always in sync with the environment, even when paths are assigned dynamically by Azure ML’s blob storage mounting.
3.4. Logging
When running a training pipeline with torchtune CLI, it may be challenging to use MLflow for logging. Therefore, you should use Torchtune’s DiskLogger
or WandBLogger
instead.
The DiskLogger
option logs metrics and training information directly to disk, making it a suitable choice when MLFlow is unavailable. Alternatively, if you have a Weights & Biases (WandB) account and API key, the WandBLogger
can be used to log metrics to your WandB dashboard, enabling remote access and visualization of training progress. This way, you can ensure robust logging and monitoring within the torchtune framework.
4. Azure ML Training
Before reading this section please refer to the Azure guide and past blogs (Blog 1, Blog 2) for basic information on Azure ML training and serving.
4.1. Dataset preparation
torchtune provides several dataset options, but in this blog, we will introduce how to save the Hugging Face dataset as json and save it as a Data asset in the Azure Blog Datastore. Please note that if you would like to build/augment your own dataset, please refer to the blog and the GitHub repo for synthetic data generation.
Instruction Dataset for SFT and KD
Preprocessing the dataset is not difficult, but don’t forget to convert the column names to match the specifications in the yaml file.
dataset = load_dataset("HuggingFaceH4/helpful_instructions", name="self_instruct", split="train[:10%]")
dataset = dataset.rename_column('prompt', 'instruction')
dataset = dataset.rename_column('completion', 'output')
print(f"Loaded Dataset size: {len(dataset)}")
if IS_DEBUG:
logger.info(f"Activated Debug mode. The number of sample was resampled to 1000.")
dataset = dataset.select(range(800))
print(f"Debug Dataset size: {len(dataset)}")
logger.info(f"Save dataset to {SFT_DATA_DIR}")
dataset = dataset.train_test_split(test_size=0.2)
train_dataset = dataset['train']
train_dataset.to_json(f"{SFT_DATA_DIR}/train.jsonl", force_ascii=False)
test_dataset = dataset['test']
test_dataset.to_json(f"{SFT_DATA_DIR}/eval.jsonl", force_ascii=False)
Preference Dataset for DPO
For the preference dataset, it may be necessary to convert it into a chat template format. Below is a code example.
def convert_to_preference_format(dataset):
json_format = [
{
"chosen_conversations": [
{"content": row["prompt"], "role": "user"},
{"content": row["chosen"], "role": "assistant"}
],
"rejected_conversations": [
{"content": row["prompt"], "role": "user"},
{"content": row["rejected"], "role": "assistant"}
]
}
for row in dataset
]
return json_format
# Load dataset from the hub
data_path = "jondurbin/truthy-dpo-v0.1"
dataset = load_dataset(data_path, split="train")
print(f"Dataset size: {len(dataset)}")
# if IS_DEBUG:
# logger.info(f"Activated Debug mode. The number of sample was resampled to 1000.")
# dataset = dataset.select(range(800))
logger.info(f"Save dataset to {DPO_DATA_DIR}")
dataset = dataset.train_test_split(test_size=0.2)
train_dataset = dataset['train']
test_dataset = dataset['test']
train_dataset = convert_to_preference_format(train_dataset)
test_dataset = convert_to_preference_format(test_dataset)
with open(f"{DPO_DATA_DIR}/train.jsonl", "w") as f:
json.dump(train_dataset, f, ensure_ascii=False, indent=4)
with open(f"{DPO_DATA_DIR}/eval.jsonl", "w") as f:
json.dump(test_dataset, f, ensure_ascii=False, indent=4)
4.2. Environment asset
You can add pip install
to the command based on the curated environment or add a conda-based custom environment, but in this blog, we will add a docker-based custom environment.
FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu124-py310-torch241:biweekly.202410.2
# Install pip dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt --no-cache-dir
# Inference requirements
COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20230419.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh &&
cp /var/configuration/rsyslog.conf /etc/rsyslog.conf &&
cp /var/configuration/nginx.conf /etc/nginx/sites-available/app &&
ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app &&
rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=400
EXPOSE 5001 8883 8888
# support Deepspeed launcher requirement of passwordless ssh login
RUN apt-get update
RUN apt-get install -y openssh-server openssh-client
RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation
[Tip] If you are building a container with Ubuntu 22.04, make sure to remove the liblttng-ust0
related packages/dependencies. Otherwise, you will get an error when building the container.
FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2204-cu124-py310-torch250:biweekly.202410.2
...
# Remove packages or dependencies related to liblttng-ust0.
# Starting from Ubuntu 22.04, liblttng-ust0 has been updated to liblttng-ust1 package, deprecating liblttng-ust0 for compatibility reasons.
# If you build a docker file on Ubuntu 22.04 without including this syntax, you will get the following liblttng-ust0 error:
# -- Package 'liblttng-ust0' has no installation candidate
RUN sed -i '/liblttng-ust0/d' /var/requirements/system_requirements.txt
...
4.3. Start a Training job
The code snippet below activates a compute cluster for training. The command
allows user to configure the following key aspects.
inputs
– This is the dictionary of inputs using name value pairs to the command.type
– The type of input. This can be auri_file
oruri_folder
. The default isuri_folder
.path
– The path to the file or folder. These can be local or remote files or folders. For remote files – http/https, wasb are supported.- Azure ML
data
/dataset
ordatastore
are of typeuri_folder
. To usedata
/dataset
as input, you can use registered dataset in the workspace using the format ‘<data_name>:’. For e.g Input(type=’uri_folder’, path=’my_dataset:1′)
- Azure ML
mode
– Mode of how the data should be delivered to the compute target. Allowed values arero_mount
,rw_mount
anddownload
. Default isro_mount
code
– This is the path where the code to run the command is locatedcompute
– The compute on which the command will run. You can run it on the local machine by usinglocal
for the compute.command
– This is the command that needs to be run in thecommand
using the${{inputs.<input_name>}}
expression. To use files or folders as inputs, we can use theInput
class. TheInput
class supports three parameters:environment
– This is the environment needed for the command to run. Curated (built-in) or custom environments from the workspace can be used.instance_count
– Number of nodes. Default is 1.distribution
– Distribution configuration for distributed training scenarios. Azure Machine Learning supports PyTorch, TensorFlow, and MPI-based distributed.
from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml.entities import ResourceConfiguration
from utils.aml_common import get_num_gpus
num_gpu = get_num_gpus(azure_compute_cluster_size)
logger.info(f"Number of GPUs={num_gpu}")
str_command = ""
if USE_BUILTIN_ENV:
str_env = "azureml://registries/azureml/environments/acpt-pytorch-2.2-cuda12.1/versions/19" # Use built-in Environment asset
str_command += "pip install -r requirements.txt && "
else:
str_env = f"{azure_env_name}@latest" # Use Curated (built-in) Environment asset
if num_gpu > 1:
tune_recipe = "lora_finetune_distributed"
str_command += "python launcher_distributed.py "
else:
tune_recipe = "lora_finetune_single_device"
str_command += "python launcher_single.py "
if len(wandb_api_key) > 0 or wandb_api_key is not None:
str_command += "--wandb_api_key ${{inputs.wandb_api_key}}
--wandb_project ${{inputs.wandb_project}}
--wandb_watch ${{inputs.wandb_watch}} "
str_command += "--train_dir ${{inputs.train_dir}}
--hf_token ${{inputs.hf_token}}
--tune_recipe ${{inputs.tune_recipe}}
--tune_action ${{inputs.tune_action}}
--model_id ${{inputs.model_id}}
--model_dir ${{inputs.model_dir}}
--log_dir ${{inputs.log_dir}}
--model_output_dir ${{inputs.model_output_dir}}
--tune_config_name ${{inputs.tune_config_name}}"
logger.info(f"Tune recipe: {tune_recipe}")
job = command(
inputs=dict(
#train_dir=Input(type="uri_folder", path=SFT_DATA_DIR), # Get data from local path
train_dir=Input(path=f"{AZURE_SFT_DATA_NAME}@latest"), # Get data from Data asset
hf_token=HF_TOKEN,
wandb_api_key=wandb_api_key,
wandb_project=wandb_project,
wandb_watch=wandb_watch,
tune_recipe=tune_recipe,
tune_action="fine-tune,run-quant",
model_id=HF_MODEL_NAME_OR_PATH,
model_dir="./model",
log_dir="./outputs/log",
model_output_dir="./outputs",
tune_config_name="lora_finetune.yaml"
),
code="./scripts", # local path where the code is stored
compute=azure_compute_cluster_name,
command=str_command,
environment=str_env,
instance_count=1,
distribution={
"type": "PyTorch",
"process_count_per_instance": num_gpu, # For multi-gpu training set this to an integer value more than 1
},
)
returned_job = ml_client.jobs.create_or_update(job)
logger.info("""Started training job. Now a dedicated Compute Cluster for training is provisioned and the environment
required for training is automatically set up from Environment.
If you have set up a new custom Environment, it will take approximately 20 minutes or more to set up the Environment before provisioning the training cluster.
""")
ml_client.jobs.stream(returned_job.name)
4.4. Logging
Use torchtune.training.metric_logging.DiskLogger
or torchtune.training.metric_logging.WandBLogger
. When applying DiskLogger
, the save path must be a subfolder of outputs. Otherwise, you cannot check it in the Azure ML UI.
Below is a screenshot of DiskLogger
applied.
Below is a screenshot of WandBLogger
applied.
Any additional training history is recorded in the user_logs folder of Azure ML. Below is an example when using Standard_NC48ads_A100_v4
(NVIDIA A100 GPU x 2ea) as a compute cluster.
Please do not forget to save the quantized model parameters when you apply fine-tuning-evaluation-quantization pipeline in your training code. It is recommended that you also save the original model weights before quantization for comparison.
4.5. Registering a Model
Once you have fine-tuned and quantized your model using torchtune, you can register it as a Model asset on Azure ML. This registration process offers several advantages, making model management and deployment more efficient and organized. Here are the advantages of Registering as a Model asset.
- Version Control: Azure ML’s Model asset allows you to maintain multiple versions of a model. Each new iteration of your model, whether it’s a different fine-tuning configuration or an updated quantization approach, can be registered as a new version. This makes it easy to track model evolution, compare performance across versions, and roll back to previous versions if necessary.
- Centralized Repository: By registering your model as an asset, you store it in a centralized repository. This repository provides easy access for other team members or projects within your organization, enabling collaboration and consistent model usage across different applications.
- Deployment Ready: Models registered as assets in AzureML are directly deployable. This means you can set up endpoints, batch inference pipelines, or other serving mechanisms using the registered model, streamlining the deployment process and minimizing potential errors.
- Metadata Management: Along with the model, you can also store relevant metadata (such as training configuration, environment details, and evaluation metrics) in the Model asset. This metadata is essential for reproducibility and for understanding model performance under different conditions.
Below is a code snippet that registers a model asset and downloads the model artifact.
def get_or_create_model_asset(ml_client, model_name, job_name, model_dir="outputs", model_type="custom_model",
download_quantized_model_only=False, update=False):
try:
latest_model_version = max([int(m.version) for m in ml_client.models.list(name=model_name)])
if update:
raise ResourceExistsError('Found Model asset, but will update the Model.')
else:
model_asset = ml_client.models.get(name=model_name, version=latest_model_version)
print(f"Found Model asset: {model_name}. Will not create again")
except (ResourceNotFoundError, ResourceExistsError) as e:
print(f"Exception: {e}")
model_path = f"azureml://jobs/{job_name}/outputs/artifacts/paths/{model_dir}"
if download_quantized_model_only:
model_path = f"azureml://jobs/{job_name}/outputs/artifacts/paths/{model_dir}/quant"
run_model = Model(
name=model_name,
path=model_path,
description="Model created from run.",
type=model_type # mlflow_model, custom_model, triton_model
)
model_asset = ml_client.models.create_or_update(run_model)
print(f"Created Model asset: {model_name}")
return model_asset
model = get_or_create_model_asset(ml_client, azure_model_name, job_name, model_dir, model_type="custom_model",
download_quantized_model_only=True, update=False)
# Download the model (this is optional)
DOWNLOAD_TO_LOCAL = False
local_model_dir = "./artifact_downloads_dpo"
if DOWNLOAD_TO_LOCAL:
os.makedirs(local_model_dir, exist_ok=True)
ml_client.models.download(name=azure_model_name, download_path=local_model_dir, version=model.version)
We have published the code to do this post end-to-end at https://github.com/Azure/torchtune-azureml. We hope you can easily perform fine-tuning/evaluation/quantization using torchtune and Azure ML.
References
- Azure ML Fine-tuning (Florence-2) Blog
- Synthetic QnA Generation Blog
- torchtune official website
- Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker
Microsoft Tech Community – Latest Blogs –Read More