How do I Evaluate my LLM Chatbot?
Earlier posts in this series:
Part 1: Is my Chatbot Ready for Production? – A 10,000 foot overview to LLMOps
After Generative AI burst onto the scene, businesses rushed to learn and leverage the technology. The first wave of adoption has most often materialized as retrieval augmented generation (RAG) chatbot products. When these initial products neared production, product owners, developers, and stakeholders soon began asking…“How do I really know if this thing is any good?”
Of course, common benchmarks for foundational models like MMLU, HellaSwag, or TruthfulQA exist, but after implementing the prompt engineering and RAG pattern needed for a specific use case, a specific testing framework is also required. The challenge is that language model outputs are inherently probabilistic [Nafar et al. 2024]. Simply put – given the same input twice, a model can produce two different outputs – and both outputs can be correct! The new wave of model evaluation and testing for generative models can be distilled into two questions:
Is my model accurate?
Is my model secure?
Is my model accurate?
Consider the following scenario:
Prompt: “Summarize the State of the Union.”
Output 1: “In the 2024 State of the Union, the President addressed key domestic and foreign policy issues. He highlighted economic growth, job creation, and infrastructure development. The President also outlined plans to strengthen healthcare, education, and national security.”
Output 2: “The President emphasized economic growth, healthcare reforms, climate action, and international cooperation in the State of the Union address. He highlighted the importance of economic innovation, education, and unity in addressing national security challenges.
The two outputs are not identical – but in this hypothetical situation they could both be considered ‘accurate’. So, how does one reliably assess accuracy? Natural language evaluation techniques can be buckets into four major categories, ranging from more probabilistic (left side in the picture below) to less probabilistic (right side in the picture below).
LLM Based
LLM Based evaluation methods have become very popular for a wide variety of natural language task assessment [Li et al. 2024]. Common LLM assessed metrics include response Coherence, Fluency, Consistency, and Relevance to a given question or context. Prompts for model assessment LLMs are highly flexible, and they can be quickly changed to improve performance with Chain-of-Thought or few-shot approaches customized to a specific use case. Research indicates that these methods are more performant than other common approaches such as BERTScore, ROGUE-L, and UniEval [Liu et al. 2023] .
Fine-Tuned Models
Fine-tuned evaluators are a natural extension of LLM based approaches. Some of these models already exist, such as Vectara’s Hallucination Evaluation model, or a team can develop their own customized fine-tuned model for a specific use cases given enough training data.
A great example of LLM / Fine-Tuned based approaches is the open source RAGAS framework. RAGAS offers a suite of metrics to users built on top of OpenAI models (by default) or any other model of a user’s choosing.
Encoding + Math
This approach seeks to transform text into a new format that can be analyzed mathematically to determine some metric. This approach can take many forms. The most common is using GPT’s Ada model(s) to embed prompt, context, and response then calculating the cosine similarity between the question-response and context-response to assess response quality. Another variation is using n-grams + BLEU score to compare question, response, and context to achieve the same goal.
Human Analysis
At the end of the day, there is no true replacement for expert human analysis.
Is my model secure?
There are many types of attacks a chatbot may face when released to users. Jailbreaking, prompt injection, data/prompt leaks, or The Waluigi Effect – just to name a few. The good news is there are many strategies to protect against these attacks. To test that your defenses are satisfactory Red Teaming is a critical part of the testing cycle. (Monitoring for attacks in real time is also a critical component – and will be discussed in detail in the next blog in this series!)
Red teaming is when a development or QA team ‘attacks’ their own application to expose and correct weaknesses. In this context, red teaming can be performed by running nefarious prompts simulating adversaries through your model and examining the results. A team can use public datasets, such as RedEval, create their own input dataset for testing specific to their user base, or a combination of both.
Put it all together
The evaluation tactics above can be leveraged to create a comprehensive evaluation framework for your LLM powered chat application. This framework can be used in tandem with normal application testing to provide confidence in a production release. The most effective testing frameworks are broken down into three parts:
Scale testing and red testing can be automated and incorporated into a CI/CD pipeline that ensure model quality and model security are above pre-defined benchmarks before allowing a release to go to production. Check out an example of CI/CD in action using Azure AI Studio. For a more custom approach, Azure PromptFlow can orchestrate many independent evaluation components to build a reliable and flexible evaluation framework. Check out this reference repository to get you started.
Finally, with an established testing framework, the development team is empowered to rapidly prototype different foundational models, prompts, and/or retrieval techniques with clear success criteria. By beginning with the end in mind, a development team can reach their goals as efficiently as possible.
The techniques offered in this blog will not only provide confidence in production deployment, but will also streamline development efforts to confidently prototype and implement state-of-the-art approaches as the field of AI continues its rapid growth!
Microsoft Tech Community – Latest Blogs –Read More