The Future of AI: Fine-Tuning Llama 3.1 8B on Azure AI Serverless, why it’s so easy & cost efficient
The Future of AI: LLM Distillation just got easier
Part 2 – Fine-Tuning Llama 3.1 8B on Azure AI Serverless
How Azure AI Serverless Fine-tuning, LoRA, RAFT and the AI Python SDK are streamlining fine-tuning of domain specific models. (🚀🔥 Github recipe repo).
Â
By Cedric Vidal, Principal AI Advocate, Microsoft
Part of the Future of AI 🚀 series initiated by Marco Casalaina with his Exploring Multi-Agent AI Systems blog post.
Â
AI-powered engine fine-tuning setup, generated using Azure OpenAI DALL-E 3
Â
In our previous blog post, we explored utilizing Llama 3.1 405B with RAFT to generate a synthetic dataset. Today, you’ll learn how to fine-tune a Llama 3.1 8B model with the dataset you generated. This post will walk you through a simplified fine-tuning process using Azure AI Fine-Tuning as a Service, highlighting its ease of use and cost efficiency. We’ll also explain what LoRA is and why combining RAFT with LoRA provides a unique advantage for efficient and affordable model customization. Finally, we’ll provide practical, step-by-step code examples to help you apply these concepts in your own projects. > The concepts and source code mentioned in this post are fully available in the Github recipe repo.
Â
Azure AI takes the complexity out of the equation. Gone are the days when setting up GPU infrastructure, configuring Python frameworks, and mastering model fine-tuning techniques were necessary hurdles. Azure Serverless Fine-Tuning allows you to bypass the hassle entirely. Simply upload your dataset, adjust a few hyperparameters, and start the fine-tuning process. This ease of use democratizes AI development, making it accessible to a wider range of users and organizations.
Why Azure AI Serverless Fine-Tuning Changes the Game
Fine-tuning a model used to be a daunting task:
Skill Requirements: Proficiency in Python and machine learning frameworks like TensorFlow or PyTorch was essential.
Resource Intensive: Setting up and managing GPU infrastructure required significant investment.
Time-Consuming: The process was often lengthy, from setup to execution.
Azure AI Fine-Tuning as a Service eliminates these barriers by providing an intuitive platform where you can fine-tune models without worrying about the underlying infrastructure. With serverless capabilities, you simply upload your dataset, specify hyperparameters, and hit the “fine-tune” button. This streamlined process allows for quick iterations and experimentation, significantly accelerating AI development cycles.
Â
Llama relaxing in a workshop, generated using Azure OpenAI DALL-E 3
LoRA: A Game-Changer for Efficient Fine-Tuning
What is LoRA?
LoRA (Low-order Rank Adaptation) is an efficient method for fine-tuning large language models. Unlike traditional fine-tuning, which updates all the model’s weights, LoRA modifies only a small fraction of the weights captured in an adapter. This focused approach drastically reduces the time and cost needed for fine-tuning while maintaining the model’s performance.
LoRA in Action
LoRA fine-tunes models by selectively adjusting a small fraction of weights via an adapter, offering several advantages:
Selective Weight Updating: Only a fraction of the weights are fine-tuned, reducing computational requirements.
Cost Efficiency: Lower computational demands translate to reduced operational costs.
Speed: Fine-tuning is faster, enabling quicker deployments and iterations.
Illustration of LoRA Fine-tuning. This diagram shows a single attention block enhanced with LoRA. Each attention block in the model typically incorporates its own LoRA module. SVG diagram generated using Azure OpenAI GPT-4o
Combining RAFT and LoRA: Why It’s So Effective
We’ve seen how Serverless Fine-tuning on Azure AI uses LoRA, which updates only a fraction of the weights of the model and can therefore be so cheap and fast.
Â
With the combination of RAFT and LORA, the model is not taught new fundamental knowledge, indeed it becomes an expert at understanding the domain, focusing its attention on the citations that are the most useful to answer a question but it doesn’t contain all the information about the domain. It is like a librarian (see RAG Hack session on RAFT), a librarian doesn’t know the content of all the books perfectly, but it knows which books contain the answers to a given question.
Â
Another way to look at it is from a standpoint of information theory. Because LoRA only updates a fraction of the weights, there is only so much information you can store in those weights as opposed to full weight fine tuning which updates all the weight bottom to top of the model.
Â
LoRA might look like a limitation but it’s actually perfect when used in combination with RAFT and RAG. You get the best of RAG and fine-tuning. RAG provides access to a potentially infinite amount of reference documents and RAFT with LoRA provides a model which is an expert at understanding the documents retrieved by RAG at a fraction of the cost of full weight fine-tuning.
Azure AI Fine-Tuning API and the Importance of Automating your AI Ops Pipeline
Azure AI empowers developers with serverless fine-tuning via an API, simplifying the integration of fine-tuning processes into automated AI operations (AI Ops) pipelines. Organizations can use the Azure AI Python SDK to further streamline this process, enabling seamless orchestration of model training workflows. This includes systematic data handling, model versioning, and deployment. Automating these processes is crucial as it ensures consistency, reduces human error, and accelerates the entire AI lifecycle—from data preparation, through model training, to deployment and monitoring. By leveraging Azure AI’s serverless fine-tuning API, along with the Python SDK, organizations can maintain an efficient, scalable, and agile AI Ops pipeline, ultimately driving faster innovation and more reliable AI systems.
Addressing Model Drift and Foundation Model Obsolescence
One critical aspect of machine learning, especially in fine-tuning, is ensuring that models generalize well to unseen data. This is the primary purpose of the evaluation phase.
Â
However, as domains evolve and documents are added or updated, models will inevitably begin to drift. The rate of this drift depends on how quickly your domain changes; it could be a month, six months, a year, or even longer.
Â
Therefore, it’s essential to periodically refresh your model and execute the distillation process anew to maintain its performance.
Moreover, the field of AI is dynamic, with new and improved foundational models being released frequently. To leverage these advancements, you should have a streamlined process to re-run distillation on the latest models, enabling you to measure improvements and deploy updates to your users efficiently.
Why Automating the Distillation Process is Essential
Automation in the distillation process is crucial. As new documents are added or existing ones are updated, your model’s alignment with the domain can drift over time. Setting up an automated, end-to-end distillation pipeline ensures that your model remains current and accurate. By regularly re-running the distillation, you can keep the model aligned with the evolving domain, maintaining its reliability and performance.
Practical Steps: Fine-Tuning Llama 3.1 8B with RAFT and LoRA
Now that we’ve explained the benefits, let’s walk through the practical steps using the raft-distillation-recipe repository on GitHub.
If you have not yet run the synthetic data generation phase using RAFT, I invite you to head over the previous article of this blog series.
Â
Once you have your synthetic dataset on hand, you can head over to the finetuning notebook of the distillation recipe repository.
Here are the key snippets of code illustrating how to use the Azure AI Python SDK to upload a dataset, subscribe to the Markerplace offer, create and submit a fine-tuning job on the Azure AI Serverless platform.
Uploading the training dataset
The following code checks if the training dataset already exists in the workspace and uploads it only if needed. It incorporates the hash of the dataset into the filename, facilitating easy detection of whether the file has been previously uploaded.
Â
Â
Â
Â
from azure.ai.ml.entities import Data
dataset_version = “1”
train_dataset_name = f”{ds_name}_train_{train_hash}”
try:
train_data_created = workspace_ml_client.data.get(train_dataset_name, version=dataset_version)
print(f”Dataset {train_dataset_name} already exists”)
except:
print(f”Creating dataset {train_dataset_name}”)
train_data = Data(
path=dataset_path_ft_train,
type=AssetTypes.URI_FILE,
description=f”{ds_name} training dataset”,
name=train_dataset_name,
version=dataset_version,
)
train_data_created = workspace_ml_client.data.create_or_update(train_data)
from azure.ai.ml.entities._inputs_outputs import Input
training_data = Input(
type=train_data_created.type, path=f”azureml://locations/{workspace.location}/workspaces/{workspace._workspace_id}/data/{train_data_created.name}/versions/{train_data_created.version}”
)
Â
Â
Â
Â
Subscribing to the Marketplace offer
This step is only necessary when fine-tuning a model from a third party vendor such as Meta or Mistral. If you’re fine-tuning a Microsoft first party model such as Phi 3 then you can skip this step.
Â
Â
Â
Â
from azure.ai.ml.entities import MarketplaceSubscription
model_id = “/”.join(foundation_model.id.split(“/”)[:-2])
subscription_name = model_id.split(“/”)[-1].replace(“.”, “-“).replace(“_”, “-“)
print(f”Subscribing to Marketplace model: {model_id}”)
from azure.core.exceptions import ResourceExistsError
marketplace_subscription = MarketplaceSubscription(
model_id=model_id,
name=subscription_name,
)
try:
marketplace_subscription = workspace_ml_client.marketplace_subscriptions.begin_create_or_update(marketplace_subscription).result()
except ResourceExistsError as ex:
print(f”Marketplace subscription {subscription_name} already exists for model {model_id}”)
Â
Â
Â
Â
Create the fine tuning job using the the model and data as inputs
Â
Â
Â
finetuning_job = CustomModelFineTuningJob(
task=task,
training_data=training_data,
validation_data=validation_data,
hyperparameters={
“per_device_train_batch_size”: “1”,
“learning_rate”: str(learning_rate),
“num_train_epochs”: “1”,
“registered_model_name”: registered_model_name,
},
model=model_to_finetune,
display_name=job_name,
name=job_name,
experiment_name=experiment_name,
outputs={“registered_model”: Output(type=”mlflow_model”, name=f”ft-job-finetune-registered-{short_guid}”)},
)
Â
Â
Â
Submit the fine-tuning job
The following snippet will submit the previously created fine-tuning job to the Azure AI serverless platform. If the submission is successful, the job details including the Studio URL and the registered model name will be printed. Any errors encountered during the submission will be displayed as well.
Â
Â
Â
Â
try:
print(f”Submitting job {finetuning_job.name}”)
created_job = workspace_ml_client.jobs.create_or_update(finetuning_job)
print(f”Successfully created job {finetuning_job.name}”)
print(f”Studio URL is {created_job.studio_url}”)
print(f”Registered model name will be {registered_model_name}”)
except Exception as e:
print(“Error creating job”, e)
raise e
Â
Â
Â
Â
The full runnable code is available in the previously mentioned finetuning notebook.
Join the Conversation
We invite you to join our tech community on Discord to discuss fine-tuning techniques, RAFT, LoRA, and more. Whether you’re a seasoned AI developer or just starting, our community is here to support you. Share your experiences, ask questions, and collaborate with fellow AI enthusiasts. Join us on Discord and be part of the conversation!
Â
What’s next?
This concludes the second installment of our blog series on fine-tuning the Llama 3.1 8B model with RAFT and LoRA, harnessing the capabilities of Azure AI Serverless Fine-Tuning. Today, we’ve shown how these advanced technologies enable efficient and cost-effective model customization that precisely meets your domain needs.
Â
By integrating RAFT and LoRA, you can transform your models into specialists that effectively navigate and interpret relevant information from extensive document repositories using RAG, all while significantly cutting down on the time and costs associated with full weight fine-tuning. This methodology accelerates the fine-tuning process and democratizes access to advanced AI capabilities.
Â
With the detailed steps and code snippets provided, you now have the tools to implement serverless fine-tuning within your AI development workflow. Leveraging automation in AI Ops will help you maintain and optimize model performance over time, keeping your AI solutions competitive in an ever-changing environment.
Â
Stay tuned! In two weeks, we’ll dive into the next topic: deploying our fine-tuned models.
​Microsoft Tech Community – Latest Blogs –Read MoreÂ