Research
6 minute read

Debugging LLMs to improve their credibility

New tools from IBM Research can help LLM users check AI-generated content for accuracy and relevance and defend against jailbreak attacks.

Debugging AI for Fairness NeurIPS -Cool Gray 20.jpg

LLMs can take some of the drudgery out of research and writing, from summarizing meeting minutes to taking a first pass at a presentation.

But on occasion, they can also mix up facts, contradict themselves, and say things they were explicitly told not to. The nerve wracking part is knowing when to take LLMs at their word, and when to double and triple check the facts because they might be hallucinating.

Their occasional missteps pose a fundamental challenge to wider AI adoption across industries like health care, education, and others. “Hallucination is an intrinsic byproduct of how LLMs are trained — to predict the next word, not the next truth,” said Alessandra Pascale, an IBM researcher focused on improving LLM reliability.

To make LLMs more trustworthy, IBM and its collaborators are developing a range of new, more powerful open source tools to debug them. IBM’s Granite Guardian models for detecting harmful and ‘hallucinated’ content currently hold six of the top 10 spots on the GuardBench leaderboard.

IBM Research is also expanding on its longstanding work in AI explainability. Its new In-Context Explainability 360 toolkit allows developers to probe the generation process to understand an LLM’s ‘thought’ process and uncover potential problems. ICX360 is the first of several LLM-focused toolkits researchers plan to roll out soon.

IBM is also developing tools to explicitly fact-check LLMs and LLM agents, called FactReasoner. The new framework checks long-form answers for accuracy by referencing external sources and weighing conflicting pieces of evidence.

FactReasoner calculates an overall accuracy score after arranging each fact and claim on a graph highlighting for users the pieces of information likely to be suspect. The tool is part of a larger project with NASA to develop a fact checker that can transparently break down AI-generated scientific content to identify and correct mistakes for scientists and the general public.

Probing for explanations

Prompt an LLM with a question or request, and you’ll get back one kind of answer. But change a few words, and you might get a different, even contradictory, answer. IBM researchers devised the MExGen algorithm to identify the words, phrases, or sentences in a prompt that most influence the model’s corresponding answer.

“It gives you some insight into what the model was referencing to generate its response which makes your fact-checking job easier,” said Dennis Wei, an IBM researcher who helped develop MExGen.

Part of the ICX360 toolkit, MExGen was featured this week in an oral presentation at the ACL 2025 conference in Vienna. It works by splitting a prompt or a retrieved document into segments and generating shorter prompts with a subset of information. A separate model compares the altered prompts and their responses to the LLM's original response and assigns a similarity score. “If the response changes a lot you know that snippet was important,” said Wei.

Collectively, these insights are translated into a semantic heat map that shows the words or phrases that most influenced the LLM’s initial answer. The most significant snippets are marked in dark blue, helping developers to pinpoint trigger words and phrases that can cause a model to go off the rails.

blogArt-llmDebug-diagramMexGEN.jpg
IBM's MExGen algorithm highlights the words, phrases, and sentences in a prompt or retrieved document that most influence the model's answer. Here, the most important words in the source document are highlighted in dark blue.

MExGen was partly inspired by the SHAP method for interpreting AI predictions, which is itself based on Shapley values from game theory, which proposes a way to fairly distribute gains (or costs) among a group of players or collaborators. Applied to LLMs, SHAP measures how much each part of a query contributes to the model’s answer.

Previous SHAP algorithms were limited to open LLMs with known probability ‘logits,’ which are numerical values indicating how likely the model is to output a particular token next. But IBM’s new method can probe models whose logits are hidden behind an API. It can also do this more efficiently, allowing users to decide whether they want a word-for-word, or sentence-by-sentence, explanation.

Contrastive explanations offer another window into LLM generation. IBM’s contrastive explanation tool, called CELL, builds on earlier work and is also part of the ICX360 toolkit. CELL iteratively rephrases the user’s initial question until one prompts the model to output an answer contradicting its original response.

Like MExGen, CELL uses a scoring function that rates both the adversarial prompts and the responses they generate by their semantic divergence from the model’s original answer. Once the model outputs a sufficiently contradictory or less preferable response, the cycle stops.

Both CELL and MExGen can reveal an LLM’s hidden trigger words that can elicit inaccurate or contradictory responses. In their experiments, the researchers found that swapping the word “important” for “worst,” in the sentence “What’s the worst part of your job?” changed the model’s snarky response from “I don’t have a job, I’m a computer program”… to something you might read in an HR manual, “Building strong relationships with colleagues and clients is the most important part of many jobs”….

blogArt-llmDebug-cellBlock.jpg
IBM's CELL algorithm iteratively rephrases the user’s initial question until one prompts the model to output an answer contradicting its original response. Both CELL and MExGen can reveal an LLM’s hidden trigger words that can elicit inaccurate or contradictory responses.

A chatbot that flip flops can be annoying. But when it generates misinformation, it could undermine societal trust — and even hurt the bottom line.

When the researchers asked a chatbot trained on IBM’s own business conduct guidelines whether consulting for competitors was forbidden, it correctly responded ‘no.’ But when the question was rephrased to replace "competitors with “other companies,” the model reversed itself, in violation of IBM guidelines.

“The prompt has the same meaning as the original prompt, but it triggers the opposite answer,” said Ronny Luss, an IBM researcher who helped develop the MExGen and CELL algorithms. “Developers can use this information to retrain the model. It’s a good example of using contrastive explanations for debugging.”

Defusing threats

A third tool in the ICX360 toolkit is a token highlighter that can help developers identify and thwart outside attempts to ‘jailbreak’ their model and override its safety guardrails. Successful jailbreaks often start with an attacker coaxing the model to give an affirmative response, such as, “Sure, I can tell you how to build a bomb.”

The tool works by flagging tokens most likely to trigger an affirmative response and neutralizing them by shrinking their embeddings. In experiments with two aligned LLMs, researchers found that the token highlighter could defend against a variety of jailbreak attacks without hindering their performance on the AlpacaEval benchmark.

“It’s a cost-effective interpretable defense because only one query to the protected LLM is needed to locate critical tokens,” said IBM researcher Pin-Yu Chen, an expert on generative AI red teaming who co-developed the method.

Validating scientific content

One of the main difficulties in flagging inaccuracies in LLM-generated content is that bad information is often intertwined with the good, with everything delivered in a fluid, authoritative tone. Different fact-checking strategies have emerged to winnow out the truth.

The most popular isolate facts and claims within a user’s question and check them for accuracy against an external knowledge source like Wikipedia or Google. A factuality score is then calculated for the entire response.

One weakness in this approach is that classifying statements as true or false is often trickier than it seems. Facts baked into an LLM through training can conflict with facts retrieved from independent sources, and those sources may themselves conflict with each other.

FactReasoner tries to inject logic into the review process. “Our intent was to design a workflow that mimics the way that humans weigh evidence and context,” said Radu Marinescu, an IBM researcher who co-developed FactReasoner.

blogArt-llmDebug-factReasoner-blue.png
FactReasoner checks each claim within a long LLM response against retrieved external sources to calculate the odds that the overall response is accurate. The claim best supported by the evidence is highlighted here in blue.

FactReasoner breaks down a long LLM-generated response into its constituent claims, and each claim is checked against two or more external sources. The external facts are assigned trustworthy scores and arranged as nodes on a graph to gauge how likely the LLM’s overall response is to be correct.

Comparing FactReasoner’s performance on several benchmark datasets for long-form factuality, the researchers found that their approach did significantly better than several leading AI-fact checking algorithms.

What’s next

Together, these techniques can minimize the chances that an LLM will undermine its own credibility. IBM Research plans to open source two complementary toolkits soon: AI Steerability 360 is designed to give users greater control over their LLM’s behavior, while a set of contextual privacy tools is designed to sanitize information that users may reveal during agentic workflows. Researchers are also working to add a self-reflection loop to FactReasoner that would allow facts and claims flagged as unreliable to be corrected.

Related posts