How to Evaluate & Upgrade Model Versions in the Azure OpenAI Service
Introduction
As an Azure OpenAI customer, you have access to the most advanced artificial intelligence models powered by OpenAI. These models are constantly improving and evolving, which means that you can benefit from the latest innovations and enhancements, including improved speed, improved safety systems, and reduced costs. However, this also means that older model versions will eventually be deprecated and retired.
We notify customers of upcoming retirements well in advance, starting from model launch.
At model launch, we programmatically designate a “not sooner than” retirement date (typically six months to one year out).
We notify customers with active deployments at least 60 days notice before model retirement for Generally Available (GA) models.
For preview model versions, which should never be used in production applications, we provide at least 30 days notice.
You can read about our process, who is notified, and details of upcoming model deprecations and retirements here: Azure OpenAI Service model retirements – Azure OpenAI | Microsoft Learn
Azure AI Studio Evaluations
We understand that upgrading model versions involves a challenging and time-consuming process of evaluation, especially if you have numerous prompts and responses to assess and certify your applications. You likely want to compare the prompt responses across different model versions to see how changes impact your use cases and outcomes.
Azure AI Studio Evaluations can help you evaluate the latest model versions in the Azure OpenAI service. Evaluations support both a code-first and UI-friendly experience, enabling you to compare prompt responses across different model versions and observe differences in quality, accuracy, and consistency. You can also use evaluations to test your prompts and applications with the new model versions at any point in your LLMOps lifecycle, making any necessary adjustments or optimizations.
A Code-First Approach to Evaluation
Azure’s Prompt Flow Evaluations SDK package is a powerful and flexible tool for evaluating responses from your generative AI application. In this blog, we will walk you through the steps of using it to evaluate your own set of prompts across various base models’ responses. The models in this example can be deployed through Azure or as external models deployed through MaaS (Model as a Service) endpoints.
You can learn more about how to use the promptflow-evals SDK package in our how-to documentation.
Getting started with Evaluations
First, install the necessary packages:
sh
pip install promptflow-evals
pip install promptflow-azure
Next, provide your Azure AI Project details so that traces, logs, and evaluation results are pushed into your project to be viewed on the Azure AI Studio Evaluations page:
python
azure_ai_project = {
“subscription_id”: “00000000000”,
“resource_group_name”: “000resourcegroup”,
“project_name”: “000000000”
}
Then, depending on which models you’d like to evaluate your prompts against, provide the endpoints you want to use. For simplicity, in our sample, an `env_var` variable is created in the code to maintain targeted model endpoints and their authentication keys. This variable is then used later in our evaluate function as our target to evaluate prompts against:
python
env_var = {
“gpt4-0613”: {
“endpoint”: “https://ai-***.***.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2023-03-15-preview“,
“key”: “***”,
},
“gpt35-turbo”: {
“endpoint”: “https://ai-***.openai.azure.com/openai/deployments/gpt-35-turbo-16k/chat/completions?api-version=2023-03-15-preview“,
“key”: “***”,
},
“mistral7b” : {
“endpoint” : “https://mistral-7b-**.ml.azure.com/chat/completions“,
“key” : “***”,
},
“tiny_llama” : {
“endpoint” : “https://api-inference.huggingface.co/**/chat/completions“,
“key” : “***”,
},
“phi3_mini_serverless” : {
“endpoint” : “https://Phi-3-mini***.ai.azure.com/v1/chat/completions“,
“key” : “***”,
},
“gpt2” : {
“endpoint” : “https://api-inference.huggingface.co/**openai/gpt2“,
“key” : “***”,
},
}
The following code creates a configuration for Azure Open AI Model, which acts as an LLM Judge for our built-in Relevance and Coherence Evaluator. This configuration is passed as a model config to these evaluators:
python
from promptflow.core import AzureOpenAIModelConfiguration
configuration = AzureOpenAIModelConfiguration(
azure_endpoint=”https://ai-***.openai.azure.com“,
api_key=””,
api_version=””,
azure_deployment=””,
)
The Prompt Flow Evaluations SDK supports a wide variety of built-in quality and safety evaluators (see the full list of supported evaluators in Built-in Evaluators) and provides the flexibility to define your own code-based or prompt-based custom evaluators.
For our example, we will just use the built-in Content Safety (a composite evaluator for measure harmful content in model responses), Relevance, and Coherence Evaluators:
python
from promptflow.evals.evaluators
import ContentSafetyEvaluator, RelevanceEvaluator, CoherenceEvaluator, GroundednessEvaluator, FluencyEvaluator, SimilarityEvaluator
content_safety_evaluator = ContentSafetyEvaluator(project_scope=azure_ai_project)
relevance_evaluator = RelevanceEvaluator(model_config=configuration)
coherence_evaluator = CoherenceEvaluator(model_config=configuration)
groundedness_evaluator = GroundednessEvaluator(model_config=configuration)
fluency_evaluator = FluencyEvaluator(model_config=configuration)
similarity_evaluator = SimilarityEvaluator(model_config=configuration)
Using the Evaluate API
Now let’s say we want to bring a list of prompts that we’d like to test across different model endpoints with our evaluators that we initialized in the previous step.
The Prompt Flow Evaluations SDK provides an Evaluate API, allowing you to evaluate model-generated responses against the provided prompts. This Evaluate API accepts a data file containing one or many prompts per line. Each prompt contains a Question, Context, and Ground Truth for evaluators to use. It also accepts an Application Target class ‘app_target.py’ whose response is evaluated against each model you’re interested in testing. We will discuss this further in detail in a later section.
The following code runs the Evaluate API and uses Content Safety, Relevance, and Coherence Evaluators. It provides a list of model types referenced in the Application Target called `ModelEndpoints` defined in ‘app_target.py’. Here are the parameters required by the Evaluate API:
Data (Prompts): Prompts, questions, contexts, and ground truths are provided in a data file in JSON format. Please find a file (data.jsonl).
Application Target: The name of the Python class that can route the calls to specific model endpoints using the model’s name in conditional logic.
Model Name: An identifier of the model so that custom code in the App Target class can identify the model type and call the respective LLM model using the endpoint URL and auth key.
Evaluators: A list of evaluators is provided to evaluate the given prompts (questions) as input and output (answers) from LLM models.
The following code runs the Evaluate API for each provided model type in a loop and logs the evaluation results into your Azure AI Studio project:
python
from app_target import ModelEndpoints
import pathlib
import random
from promptflow.evals.evaluate import evaluate
models = [“gpt4-0613”, “gpt35-turbo”, “mistral7b”, “phi3_mini_serverless” ]
path = str(pathlib.Path(pathlib.Path.cwd())) + “/data.jsonl”
for model in models:
randomNum = random.randint(1111, 9999)
results = evaluate(
azure_ai_project=azure_ai_project,
evaluation_name=”Eval-Run-“+str(randomNum)+”-“+model.title(),
data=path,
target=ModelEndpoints(env_var, model),
evaluators={
“content_safety”: content_safety_evaluator,
“coherence”: coherence_evaluator,
“relevance”: relevance_evaluator,
“groundedness”: groundedness_evaluator,
“fluency”: fluency_evaluator,
“similarity”: similarity_evaluator,
},
evaluator_config={
“content_safety”: {
“question”: “${data.question}”,
“answer”: “${target.answer}”
},
“coherence”: {
“answer”: “${target.answer}”,
“question”: “${data.question}”
},
“relevance”: {
“answer”: “${target.answer}”,
“context”: “${data.context}”,
“question”: “${data.question}”
},
“groundedness”: {
“answer”: “${target.answer}”,
“context”: “${data.context}”,
“question”: “${data.question}”
},
“fluency”: {
“answer”: “${target.answer}”,
“context”: “${data.context}”,
“question”: “${data.question}”
},
“similarity”: {
“answer”: “${target.answer}”,
“context”: “${data.context}”,
“question”: “${data.question}”
}
}
)
The file app_target.py is used as the Application Target in which individual Python functions call specified model endpoints. In this file, the `__init__` function of the `ModelEndpoints` class stores a list of model endpoints and keys in the variable env. The model type is also provided so that the specific model can be called:
python
import requests
from typing_extensions import Self
from typing import TypedDict
from promptflow.tracing import trace
class ModelEndpoints:
def __init__(self: Self, env: dict, model_type: str) -> str:
self.env = env
self.model_type = model_type
The `__call__` function of the `ModelEndpoints` class routes the calls to a specific model endpoint by model type using conditional logic:
python
def __call__(self: Self, question: str) -> Response:
if self.model_type == “gpt4-0613”:
output = self.call_gpt4_endpoint(question)
elif self.model_type == “gpt35-turbo”:
output = self.call_gpt35_turbo_endpoint(question)
elif self.model_type == “mistral7b”:
output = self.call_mistral_endpoint(question)
elif self.model_type == “tiny_llama”:
output = self.call_tiny_llama_endpoint(question)
elif self.model_type == “phi3_mini_serverless”:
output = self.call_phi3_mini_serverless_endpoint(question)
elif self.model_type == “gpt2”:
output = self.call_gpt2_endpoint(question)
else:
output = self.call_default_endpoint(question)
return output
The following code handles the POST call to the model endpoint. It captures the response and parses it accordingly to retrieve the answer from the LLM. One of the sample functions is provided below:
python
def query (self: Self, endpoint: str, headers: str, payload: str) -> str:
response = requests.post(url=endpoint, headers=headers, json=payload)
return response.json()
def call_gpt4_endpoint(self: Self, question: str) -> Response:
endpoint = self.env[“gpt4-0613”][“endpoint”]
key = self.env[“gpt4-0613”][“key”]
headers = {
“Content-Type”: “application/json”,
“api-key”: key
}
payload = {
“messages”: [{“role”: “user”, “content”: question}],
“max_tokens”: 500,
}
output = self.query(endpoint=endpoint, headers=headers, payload=payload)
answer = output[“choices”][0][“message”][“content”]
return {“question”: question, “answer”: answer}
def call_gpt35_turbo_endpoint(self: Self, question: str) -> Response:
endpoint = self.env[“gpt35-turbo”][“endpoint”]
key = self.env[“gpt35-turbo”][“key”]
headers = {“Content-Type”: “application/json”, “api-key”: key}
payload = {“messages”: [{“role”: “user”, “content”: question}], “max_tokens”: 500}
output = self.query(endpoint=endpoint, headers=headers, payload=payload)
answer = output[“choices”][0][“message”][“content”]
return {“question”: question, “answer”: answer}
def call_mistral_endpoint(self: Self, question: str) -> Response:
endpoint = self.env[“mistral7b”][“endpoint”]
key = self.env[“mistral7b”][“key”]
headers = {
“Content-Type”: “application/json”,
“Authorization”: (“Bearer ” + key)
}
payload = {
“messages”: [{“content”: question, “role”: “user”}],
“max_tokens”: 50}
output = self.query(endpoint=endpoint, headers=headers, payload=payload)
answer = output[“choices”][0][“message”][“content”]
return {“question”: question, “answer”: answer}
You can view the full sample notebook here.
Compare your evaluation results in Azure AI Studio
Once you run your evaluation in the SDK and log your results to your project, you can compare your results across different model evaluations in Azure AI Studio. Inside your project, you can use the left hand navigation menu under the “Tools” section to get to your Evaluation runs.
By default you can see all your model evaluations run show up here if you’ve logged the results to your project in the SDK. To compare the evaluations direction, click “Switch to dashboard view” located above the list of evaluations:
Then select which evaluations you want to visualize in the dashboard view to compare:
In addition to comparing overall and row level outputs and metrics, you can open each evaluation run directly to see overall distribution of metrics in a chart view for both quality and safety evaluators, which you can switch between by selecting the each tab above the charts.
Read more on how to view results in Azure AI Studio here.
How to upgrade your deployment
Luckily, once you’ve run your evaluations and decided to upgrade to the latest model version, the process is relatively simple to set your deployments to auto-upgrade to the default.
When a new model version is set as the default in the service, your deployments will automatically upgrade to that version. You can read more about this process here.
Conclusion
Through this article, we’ve walked through not only how to upgrade your deployments to the latest generative AI model versions, but how to use our suite of Azure AI evaluation tools to evaluate which model versions best meet your needs.
Once you’ve decided on the right model version for your solution, upgrading to the latest is a matter of a few simple few clicks.
As always, we constantly strive to improve our services, if you have any feedback or questions, please feel free to speak with our support team, or leave product suggestions or feedback in our Azure Feedback Forum, tagging the suggestion with Azure OpenAI.
Microsoft Tech Community – Latest Blogs –Read More