Load Testing RAG based Generative AI Applications
Building an Effective Strategy
Mastering Evaluation Techniques
How-To Guides
Building an Effective Strategy
Identifying What to Evaluate
The user interacts with the frontend UI to pose a question.
The frontend service forwards the user’s question to the Orchestrator.
The Orchestrator retrieves the user’s conversation history from the database.
The Orchestrator accesses the AI Search key stored in the Key Vault.
The Orchestrator retrieves relevant documents from the AI Search index.
The Orchestrator uses Azure OpenAI to generate a user response.
The connection from the App Service to Storage Account indicates the scenario when the user wants to view the document that grounds the provided answer.
The connection from the App Service to Speech Services indicates the cases when the user wishes to interact with the application through audio.
Test Scenario
RPM = (u * p * s * i) / n / 60
u=10000 (total users)
p=0.1 (percentage of active users during peaktime)
s=1 (sessions per user)
i=2 (interactions per session)
n=1 (peaktime duration in hours)
Test Data
Test Measurements
Client metrics
Metric
Description
Number of Virtual Users
This metric shows the virtual user count during a load test, helping assess application performance under different user loads.
Requests per Second
This is the rate at which requests are sent to the LLM App during the load test. It’s a measure of the load your application can handle.
Response Time
This refers to the duration between sending a request and receiving the full response. It does not include any time spent on client-side response processing or rendering.
Latency
The latency of an individual request is the total time from just before sending the request to just after the first response is received.
Number of Failed Requests
This is the count of requests that failed during the load test. It helps identify the reliability of your application under stress.
Simplified example of the breakdown of request response time.
Performance Metrics for a LLM
Metric
Description
Number Prompt Tokens per Minute
Rate at which the client sends prompts to the OpenAI model.
Number Generated Tokens per Min
Rate at which the OpenAI model generates response tokens.
Time to First Token (TTFT)
The time interval between the start of the client’s request and the arrival of the first response token.
Time Between Tokens (TBT)
Time interval between consecutive response tokens being generated.
Server metrics
Service Name
Metric
Description
Azure OpenAI
Azure OpenAI Requests
Total calls to Azure OpenAI API.
Azure OpenAI
Generated Completion Tokens
Output tokens from Azure OpenAI model.
Azure OpenAI
Processed Inference Tokens
The number of input and output tokens that are processed by the Azure OpenAI model.
Azure OpenAI
Provision-managed Utilization V2
The percentage of the provisioned-managed deployment that is currently being used.
Azure App Service
CPU Percentage
The percentage of CPU used by the App backend services.
Azure App Service
Memory Percentage
The percentage of memory used by the App backend services.
Azure Cosmos DB
Total Requests
Number of requests made to Cosmos DB.
Azure Cosmos DB
Provisioned Throughput
The amount of throughput that has been provisioned for a container or database.
Azure Cosmos DB
Normalized RU Consumption
The normalized request unit consumption based on the provisioned throughput.
Azure API Management
Total Requests
Total number of requests made to APIM.
Azure API Management
Capacity
Percentage of resource and network queue usage in APIM instance.
When should I evaluate performance?
Enterprise LLM Lifecycle.
Mastering Evaluation Techniques
Great job on your journey so far in learning the essentials of your testing strategy! As we proceed in this section, we will be examining two distinct evaluation techniques. The first technique will concentrate on the performance testing of the entire LLM application, while the second will be primarily focused on testing the deployed LLM. It’s important to remember that these are just two popular instances from a wide-ranging list. Depending on your unique performance requirements, integrating other techniques into your testing strategy may prove beneficial.
LLM App Load Testing
Concept
Description
Test
Refers to a performance evaluation setup that assesses system behavior under simulated loads by configuring load parameters, test scripts, and target environments.
Test Run
Represents the execution of a Test.
Test Engine
Engine that runs the JMeter test scripts. Adjust load test scale by configuring test engine instances.
Threads
Are parallel threads in JMeter that represent virtual users. They are limited to a maximum of 250.
Virtual Users (VUs)
Simulate concurrent users. Calculated as threads * engine instances.
Ramp-up Time
Is the time required to reach the maximum number of VUs for the load test.
Latency
The latency of an individual request is the total time from just before sending the request to just after the first response is received.
Response Time
This refers to the duration between sending a request and receiving the full response. It does not include any time spent on client-side response processing or rendering.
You can securely store keys and credentials used during the test as Azure Key Vault secrets, and Azure Load Testing can also have its managed identity for access to Azure resources. When deployed within your virtual network, it can generate load directed at your application’s private endpoint. Application authentication through access tokens, user credentials, or client certificates is also supported, depending on your application’s requirements.
Monitoring Application Resources
Load Testing Automation
az loadtest create
–name $loadTestResource
–resource-group $resourceGroup
–location $location
–test-file @path-to-your-jmeter-test-file.jmx
–configuration-file @path-to-your-load-test-config.yaml
az loadtest run
–name $loadTestResource
–resource-group $resourceGroup
–test-id $testId
Key Metrics to Monitor During Load Tests
Request Rate: Monitor the request rate during load testing. Ensure that the LLM application can handle the expected number of requests per second.
Response Time: Analyze response times under different loads. Identify bottlenecks and optimize slow components.
Throughput: Measure the number of successful requests per unit of time. Optimize for higher throughput.
Resource Utilization: Monitor CPU, memory, and disk usage. Ensure efficient resource utilization.
Best Practices for Executing Load Tests
Test Scenarios: Create realistic test scenarios that mimic actual user behavior
Ramp-Up Strategy: Gradually increase the load to simulate real-world traffic patterns. The warm-up period typically lasts between 20 to 60 seconds. After the warm-up, the actual load test begins
Think Time: Include think time between requests to simulate user interactions.
Geographical Distribution: Test from different Azure regions to assess global performance.
Performance Tuning Strategies for LLM Apps
Application Design
Optimize Application Code: Examine and refine the algorithms and backend systems of your LLM application to increase efficiency. Utilize asynchronous processing methods, such as Python’s async/await, to elevate application performance. This method allows data processing without interrupting other tasks.
Batch Processing: Batch LLM requests whenever possible to reduce overhead. Grouping multiple requests for simultaneous processing improves throughput and efficiency by allowing the model to better leverage parallel processing capabilities, thereby optimizing overall performance.
Implement Caching: Use caching for repetitive queries to reduce the application’s load and speed up response times. This is especially beneficial in LLM applications where similar questions are frequently asked. Caching answers to common questions minimizes the need to run the model repeatedly for the same inputs, saving both time and computational resources. Some examples of how you can implement this include using Redis as a semantic cache or Azure APIM policies.
Revisit your Retry Logic: LLM model deployments might start to operate at their capacity, which can lead to 429 errors. A well-designed retry mechanism can help maintain application responsiveness. With the OpenAI Python SDK, you can opt for an exponential backoff algorithm. This algorithm gradually increases the wait time between retries, helping to prevent service overload. Additionally, consider the option of falling back on another model deployment. For more information, refer to the load balance item in the Solution Architecture section.
Prompt Design
Generate Less Tokens: To reduce model latency, create concise prompts and limit token output. According to the OpenAI latency optimization guide, cutting 50% of your output tokens can reduce latency by approximately 50%. Utilizing the ‘max_tokens’ parameter can also expedite response time.
Optimize Your Prompt: If dealing with large amounts of context data, consider prompt compression methods. Approaches like those offered by LLMLingua-2, fine-tuning the model to reduce lengthy prompts, eliminating superfluous RAG responses, and removing extraneous HTML can be efficient. Trimming your prompt by 50% might only yield a latency reduction of 1-5%, but these strategies can lead to more substantial improvements in performance.
Refine Your Prompt: Optimize the prompt text by placing dynamic elements, such as RAG results or historical data, toward the end of your prompt. This enhances compatibility with the KV cache system commonly used by most large language model providers. As a result, fewer input tokens need processing with each request, increasing efficiency.
Use Smaller Models: Whenever possible, pick smaller models because they are faster and more cost-effective. You can improve their responses by using detailed prompts, a few examples, or by fine-tuning.
Solution Architecture
Provisioned Throughput Deployments: When using Azure OpenAI use provisioned throughput in scenarios requiring stable latency and predictable performance, avoiding the ‘noisy neighbor’ issue in regular standard deployments.
Load Balancing LLM Endpoints: Implement load balancing for LLM deployment endpoints. Distribute the workload dynamically to enhance performance based on endpoint latency. Establish suitable rate limits to prevent resource exhaustion and ensure stable latency.
Resource Scaling: If services show strain under increased load, consider scaling up resources. Azure allows seamless scaling of CPU, RAM, and storage to meet growing demands.
Network Latency: Position Azure resources, like the Azure OpenAI service, near your users geographically to minimize network latency during data transmission to and from the service.
Azure OpenAI Benchmarking
Test Parameters
The benchmarking tool contains a number of configuration parameters to configure the test, as well as two script entry points. The benchmark.bench entry point is the basic script point, while the benchmark.contrib.batch_runner entry point can run batches of multiple workload configurations, and will automatically warm up the model endpoint prior to each test workload. It is recommended to use the batch_runner entry point to ensure accurate results and a much simpler testing process, especially when running tests for multiple workload profiles or when testing with PTU model deployments.
Parameter
Description
rate
Controls the frequency of requests in Requests Per Minute (RPM), allowing for detailed management of test intensity.
clients
Enables you to specify the number of parallel clients that will send requests simultaneously, providing a way to simulate varying levels of user interaction.
context-generation-method
Allows you to select whether to automatically generate the context data for the test (–context-generation-method generate), or whether to use existing messages data for the test (–context-generation-method replay)
shape-profile
Adjusts the request characteristics based on the number of context and generated tokens, enabling precise testing scenarios that reflect different usage patterns. Options include “balanced”, “context”, “custom” or “generation”.
context-tokens (for custom shape-profile)
When context-generation-method = generate and shape-profile = custom, this allows you to specify the number of context tokens in the request.
max-tokens (for custom shape-profile)
This allows you to specify the maximum number of tokens that should be generated in the response.
aggregation-window
Defines the duration, in seconds, for which the data aggregation window spans. Before the test hits the aggregation-window duration, all stats are computed over a flexible window, equivalent to the elapsed time. This ensures accurate RPM/TPM stats even if the test ends early due to hitting the request limit. A value of 60 seconds or more is recommended.
log-save-dir
If provided, the test log will be automatically saved to the directory, making analysing and comparing different benchmarking runs simple.
Warming up PTU endpoints
Retry Strategy
Output Metrics
measure
description
ttft
Time to First Token. Time in seconds from the beginning of the request until the first token was received.
tbt
Time Between Tokens. Time in seconds between two consecutive generated tokens.
e2e
End to end response time.
context_tpr
Number of context tokens per request.
gen_tpr
Number of generated tokens per request.
util
Azure OpenAI deployment utilization percentage as reported by the service (only for PTU deployments).
Sample Scenarios
1. Using the benchmark.bench entrypoint
–deployment gpt-4
–rate 60
–retry none
–log-save-dir logs/
https://myaccount.openai.azure.com
2023-10-19 18:21:06 INFO using shape profile balanced: context tokens: 500, max tokens: 500
2023-10-19 18:21:06 INFO warming up prompt cache
2023-10-19 18:21:06 INFO starting load…
2023-10-19 18:21:06 rpm: 1.0 requests: 1 failures: 0 throttled: 0 ctx tpm: 501.0 gen tpm: 103.0 ttft avg: 0.736 ttft 95th: n/a tbt avg: 0.088 tbt 95th: n/a e2e avg: 1.845 e2e 95th: n/a util avg: 0.0% util 95th: n/a
2023-10-19 18:21:07 rpm: 5.0 requests: 5 failures: 0 throttled: 0 ctx tpm: 2505.0 gen tpm: 515.0 ttft avg: 0.937 ttft 95th: 1.321 tbt avg: 0.042 tbt 95th: 0.043 e2e avg: 1.223 e2e 95th: 1.658 util avg: 0.8% util 95th: 1.6%
2023-10-19 18:21:08 rpm: 8.0 requests: 8 failures: 0 throttled: 0 ctx tpm: 4008.0 gen tpm: 824.0 ttft avg: 0.913 ttft 95th: 1.304 tbt avg: 0.042 tbt 95th: 0.043 e2e avg: 1.241 e2e 95th: 1.663 util avg: 1.3% util 95th: 2.6%
2. Using the benchmark.contrib.batch_runner entrypoint
context_tokens=500, max_tokens=100, rate=20
context_tokens=3500, max_tokens=300, rate=7.5
With the num-batches and batch-start-interval parameters, it will also run the same batch of tests every hour over the next 4 hours:
–deployment gpt-4-1106-ptu –context-generation-method generate
–token-rate-workload-list 500-100-20,3500-300-7.5 –duration 130
–aggregation-window 120 –log-save-dir logs/
–start-ptum-runs-at-full-utilization true –log-request-content true
–num-batches 5 –batch-start-interval 3600
For more detailed examples, refer to the README within the repository.
Processing and Analyzing the Log Files
After running the tests, the separate logs can be automatically processed and combined into a single output CSV. This CSV will contain all configuration parameters, aggregate performance metrics, and the timestamps, call status and content of every individual request.
With the combined CSV file, the runs can now easily be compared to each other, and with the individual request data, more detailed graphs that plot all request activity over time can be generated.
Monitoring AOAI Resource
| where TimeGenerated between(datetime(2024-04-26T15:30:00) .. datetime(2024-04-26T16:30:00))
| where OperationName == “ChatCompletions_Create”
| project TimeGenerated, _ResourceId, Category, OperationName, DurationMs, ResultSignature, properties_s
How-To Guides
LLM RAG application testing with Azure Load Testing.
Model deployment testing with AOAI Benchmarking Tool.
Wrapping Up
In conclusion, performance evaluation is crucial in optimizing LLM applications. By understanding your application’s specifics, creating an efficient strategy, and utilizing appropriate tools, you can tackle performance issues effectively. This boosts user experience and ensures that your application can handle real-world demands. Regular performance evaluations using methods such as load testing, benchmarking, and continuous monitoring can lead to your LLM application’s ultimate success.
Microsoft Tech Community – Latest Blogs –Read More