Explainer
4 minute read

What is red teaming for generative AI?

Red teaming is a way of interactively testing AI models to protect against harmful behavior, including leaks of sensitive data and generated content that’s toxic, biased, or factually inaccurate.

Red teaming is a way of interactively testing AI models to protect against harmful behavior, including leaks of sensitive data and generated content that’s toxic, biased, or factually inaccurate.

Red teaming predates modern generative AI by many decades. During the Cold War, the US military ran simulation exercises pitting US “blue” teams against Soviet “red” teams. Through simulated conflict, red teaming became associated with learning to think like the enemy.

The practice was later adopted by the IT industry, which used red teaming to probe computer networks, systems, and software for weaknesses that could be exploited by malicious attackers. Born from this work, red teaming now has a new domain: stress-testing generative AI for a broad range of potential harms, from safety to security to social bias.

Like traditional software, content-generating foundation models can be attacked by bad actors looking to steal data or disrupt service. But generative AI poses additional risks arising from its capacity to mimic human-created content at a massive scale. Problematic responses can include hate speech, pornography, “hallucinated” facts, copyrighted material, or private data like phone and social security numbers that were never meant to be shared.

Red teaming for generative AI involves provoking the model to say or do things it was explicitly trained not to, or to surface biases unknown to its creators. When problems are exposed through red teaming, new instruction data is created to re-align the model, and strengthen its safety and security guardrails.

In the early days of ChatGPT, people traded tips on Reddit for how to “jailbreak,” or bypass its safety filters with carefully worded prompts. Under one type of jailbreak, a bot can be made to give advice for building a bomb or committing tax fraud simply by asking it to play the role of a rule-breaking character. Other tactics include translating prompts into a rarely used language or appending AI-generated gibberish to a prompt to exploit weaknesses in the model imperceptible to humans.

“Generative AI is actually very difficult to test,” said IBM’s Pin-Yu Chen, who specializes in adversarial AI-testing. “It’s not like a classifier, where you know the outcomes. With generative AI, the generation space is very large, and that requires a lot more interactive testing.”

Red teaming for safe, secure, and trustworthy AI

LLMs are encoded with human values and goals during the alignment phase of fine-tuning. Alignment involves feeding the model examples of the target task in the form of questions and answers known as instructions. A human or another AI then interacts with the model, asking questions and grading its responses. A reward model is trained to mimic the positive feedback, and those preferences are used to align the model.

AI-red teaming can be thought of as an extension of alignment, with the goal of designing prompts to get past the model’s safety controls. Jailbreak prompts are still engineered by humans, but these days most are generated by “red team” LLMs that can produce a wider variety of prompts in limitless quantities.

Think of red team LLMs as toxic trolls trained to bring out the worst in other LLMs. Once vulnerabilities are surfaced, the target models can be re-aligned. With the help of red team LLMs, IBM has generated several adversarial and open-source datasets that have helped to improve its Granite family of models on watsonx, as well as the open-source community multilingual model, Aurora.

“In this extended game of cat and mouse, we need to stay on our toes,” said IBM’s Eitan Farchi, an expert on natural language processing. “No sooner does a model become immune to one attack style, than a new one appears. Fresh datasets are constantly needed,”

A dataset called AttaQ is meant to provoke the target LLM into offering tips on how to commit crimes and acts of deception. A related algorithm categorizes the undesirable responses to make finding and fixing the exposed vulnerabilities easier. Another, SocialStigmaQA, is aimed at drawing out a broad range of racist, sexist, and otherwise extremely offensive responses. A third red-team dataset is designed to surface harms outlined by US President Joe Biden last fall in his executive order on AI.

If red team LLMs focus solely on generating prompts likely to trigger the most toxic responses from their targets, they run the risk of resurfacing familiar problems and missing rarer, more serious ones. To encourage more imaginative trolling, IBM and MIT researchers introduced a “curiosity”-driven algorithm that adds novelty as an objective in prompt-generation.

"This lets you cast a wider net for the less obvious prompts that can trigger unsafe outputs," said Zhang-Wei Hong, an MIT student who co-authored the work with researchers at the MIT-IBM Watson AI Lab.

As red teaming has evolved, it has uncovered new threats and underscored the pervasive risks of generative AI. At IBM, Chen recently demonstrated that the safety alignment of proprietary models can be as easy to crack as open-source models.

In a recently published paper, Chen and collaborators at Princeton and Virginia Tech showed that OpenAI’s GPT-3.5 Turbo could be broken with just a few tuning instructions submitted to its API. The fine-tuning process itself, Chen hypothesizes, appears to overwrite some of the model’s safeguards.

Diffusion models share some of the same vulnerabilities as LLMs. Chen has developed red teaming tools, Prompting4Debugging and Ring-A-Bell, to stress-test image-generating models. He has found that half of the prompts in so-called “safe prompting” benchmarks can be hijacked so that models output images with nudity and violence.

Beyond probing AI systems for safety and security flaws, researchers are also developing ways to protect them from attacks in the wild. In another recent paper, Chen showed that his GradientCuff detection tool could reduce the success of six types of LLM attacks from 75% to 25%. [Check out the demo on Hugging Face].

Overcoming the “unknown unknowns”

It’s unclear exactly where generative AI is headed, but all signs point to red teaming playing an important role.

The White House last summer co-led a red teaming hackathon at DEFCON. It was followed by President Biden's executive order on AI that's expected to lead to legislation. The European Union last month signed the world’s first artificial intelligence law, banning some uses of AI, including social scoring systems, and requiring companies to assess and mitigate the risks associated with generative AI.

Other countries are moving to draw up their own laws. In the US, the National Institute of Standards and Technology (NIST) just launched the Artificial Intelligence Safety Institute, a consortium of 200 AI stakeholders that includes IBM.

To achieve the breadth and scale needed to stress test enormous language models, red teaming is becoming increasingly automated. But humans will continue to play an integral role. After all, the unsafe and undesirable behaviors we want to eradicate in them are mirror reflections of our own.

Kush Varshney, an IBM Fellow who researches AI governance, leads the innovation pipeline for watsonx.governance, a set of tools for auditing models deployed on IBM’s AI platform. Red teaming is an ongoing process, he said, and its success depends on having people of all types probing models for flaws.

“There will always be unknown unknowns, so you need humans with diverse viewpoints and lived experiences to get these models to misbehave,” Varshney said. “The models keep changing, and the world keeps changing. Red teaming is never done.”