Responsible AI Mitigation Layers
Generative AI is increasingly being used in various kinds of systems to augment humans and infuse intelligent behavior into existing and new apps. While this opens up a world of opportunities for new functionalities, it has also created a new set of risks due to its probabilistic nature and interaction using natural language prompts.
When building generative AI applications, it’s important to address potential risks to ensure safety and responsible AI deployment. The first category of risk is the overall quality of the application. We need to prevent errors where the system produces incorrect or fabricated information. Another key area is robustness against adversarial attacks, such as jailbreak attempts, where users manipulate the system to bypass restrictions, or newer threats like prompt injection attacks, where hidden instructions are embedded within data sources. Additionally, traditional risks, including harmful content — be it inappropriate language, imagery, or security-vulnerable code—must be mitigated. As the interaction with these models is using natural language like humans, it can make them susceptible to attacks similar to social engineering for humans. It raises questions about when it is appropriate for systems to behave like humans and when it might lead to misleading or harmful experiences. Despite the range of risks, we have found that a common framework and layered defense mechanisms can address these diverse challenges effectively.
In this blog post, we will talk about the mitigation strategies to be used against attack against generative AI systems. But before we do that, lets take a quick look at the underlying Responsible AI principles that guide these mitgation mechanisms.
AI is a transformative horizontal technology like the Internet that will change the way we interact and work with technology. We must use it responsibly and ethically. At Microsoft, our commitment begins with our AI principles, which guide our approach to responsible AI development. Our principle of Fairness ensures that AI systems allocate opportunities, resources, and information in ways that are equitable for all users. Reliability and Safety and Privacy and Security focus on building systems that perform well across various contexts, including those they weren’t originally designed for, while safeguarding user data. The principle of Inclusiveness emphasizes designing AI to be accessible to people of all abilities. Finally, Transparency and Accountability ensure that AI systems are understandable, minimize misuse, and allow for human oversight and control. These six principles, remain a steadfast foundation, adapting to new challenges as AI technologies evolve and provide a framework to guide our implementations.
If we look at the Generative AI lifecycle, it can be broken down into 5 different stages in an iterative process:
Governance – You begin by aligning roles and responsibilities as well as establishing requirements.
Map – When a use case is defined, you can try red teaming using tools such as PyRiT to figure out what is possible.
Measure – Once you have figured out, what is possible how you can measure the extent of the risk at scale.
Mitigate – Next, we should try to reduce or eliminate those risks using different system checks
Operate – Finally, once the mitigation is in place, you operationalize the system to monitor the risk and respond to the incidents. This happens continually in a loop as you identify newer risks in production and constantly update the systems to respond and mitigate.
Let’s now talk about the mitigation layers in more detail. We find that most production applications require a mitigation plan with four layers of technical mitigations:
The model
Safety system
System message and grounding
User experience layers.
The model and safety system layers are typically platform layers, where built-in mitigations would be common across many applications. They are built into Azure for you. The next two layers largely depend on the application’s purpose and design, meaning the implementation of mitigations can vary a lot from one application to the next. Also, the foundation model you’re using is an important component of the system, but it’s not the complete system.
Model
Choosing the right foundational model is a crucial step in developing an effective AI application. But how do you determine which model best suits your needs? The Azure AI model catalog offers over 1,600 models along with tools to help you make the best choice. For instance, it provides powerful benchmarking and evaluation capabilities, allowing you to compare the performance of multiple models side-by-side to find the one that performs best for your use case. Additionally, you can access model cards that offer detailed information from the model provider. These resources help you assess whether a model is a good fit for your application, while also highlighting potential risks that need to be mitigated and monitored throughout development.
Safety System
For most applications, relying solely on the safety measures built into a model is insufficient. Even with fine-tuning, large language models (LLMs) can still make errors and are vulnerable to attacks like jailbreaks. That’s why, at Microsoft, we employ a layered, defense-in-depth approach to AI safety. We use an AI-based safety system that surrounds the model, monitoring inputs and outputs to help prevent attacks and identify mistakes. In some cases, this can be a specialized model trained for safety. This advanced system, known as Azure AI Content Safety, features configurable filters to monitor for violence, hate, sexual language and self-harm. The filters allow customization based on specific use cases, e.g a gaming company may allow more permissive language in inputs but restrict violent language in outputs. This safety system is integrated directly into the Microsoft Copilot ecosystem, ensuring built-in protection. Moreover, we offer this technology to developers via Azure AI, enabling them to create safer AI applications from the outset.
Azure AI content Safety incorporates 3 types of filters:
Content filters – for harmful content, like text and imagery containing violence or hate speech, which you can adjust by severity level. These are always set to a medium threshold by default. For Azure OpenAI models, only customers who have been approved for modified content filtering have full content filtering control, including configuring content filters at severity level high only or turning off content filters. Apply for modified content filters via this form
Security -Prompt shields are detection models that can be turned on for model inputs, to detect when a user is trying to attack or manipulate the AI system into doing something outside its intended purpose or design.
Quality – Detection models that can be turned on to flag other kinds of risky inputs or outputs, such as protected or copyright material or code, or factually incorrect information where the model output does not align to the source material provided.
In addition, Customers can also create custom blocklists to filters specific terms in inputs or outputs.
System Message and Grounding
The system message and grounding layer is where back-end prompt engineering comes into play, allowing applications to harness the power of large language models (LLMs) effectively. Although these models are trained on vast amounts of information, they have limitations, and their embedded knowledge stops at training time. They also lack access to private or proprietary data which can be used to differentiate your application. By combining the general reasoning and language generation capabilities of LLMs with specific business information, it is possible to address these limitations. Retrieval-augmented generation (RAG) is a popular approach for implementing this. RAG allows the model to search or utilize your data to provide accurate answers. For example, in Bing Chat, when a search is performed, the model retrieves and search results to generate more precise responses. This ensures that your data is fresh and accurate, reducing reliance on the model’s pre-existing knowledge. Additionally, how you guide the model to use this data effectively is vital. The system message, or metaprompt, significantly influences the tone, style, and scope of the model’s responses, shaping how it interacts within your application.
Even small changes to a system message can have an outsized impact. Using the system message, it is far more effective to tell a model what not to do and what to do instead than simply telling it what not to do. These learnings from our own experience of building multiple copilots are integrated into Azure AI Studio playground, so you don’t need to start from scratch when building a system message.
User Experience
User Experience is a layer where a lot of design innovation is happening to build new ways of interacting with AI systems. Using the guidelines below will help in building better user interfaces and experiences.
Be transparent about AI role and limitations – This will help users be vigilant about detecting potential mistakes. Also disclose the role AI plays in the process as it will set user expectations for using the system.
Ensure humans stay in the loop – Typically AI systems augment humans, so incorporating mechanisms for human feedback is important. This is especially true for high risk use cases in finance, healthcare and insurance amongst others.
Mitigate misuse and overreliance – Provide data citations for further verification and perusal and prepare pre-determined responses
Documentation – It is always useful to provide user guidelines and best practices for users of the system.
Using the above mechanisms and guidelines, I hope that you have got a good overview of how to build better AI applications and you can take a look at the following resources to learn more:
Microsoft TrustWorthy AI – https://blogs.microsoft.com/blog/2024/09/24/microsoft-trustworthy-ai-unlocking-human-potential-starts-with-trust/
Microsoft Responsible AI guidelines – https://www.microsoft.com/en-us/ai/principles-and-approach
The Python Risk Identification Tool for generative AI (PyRIT) – https://github.com/Azure/PyRIT
Operationalize AI responsibly with Azure AI Studio – https://www.youtube.com/watch?v=FHeVBfqelts/
RAI Playlist – aka.ms/RAI-playlist
Operationalize AI Responsibly with Azure AI Studio Learning Path – aka.ms/RAI-learn-path
Microsoft Tech Community – Latest Blogs –Read More