How to know if your AI agents are working as intended

First unveiled at this year's IBM Think, AgentOps is a set of tools born out of IBM Research that make it easier for AI builders, developers, and engineers to see whether the agents they've built are operating as expected.

In the span of a few short years, generative AI has gone from little more than proof-of-concept demos to reshaping countless industries. It’s now estimated that generative AI could account for up to $4.4 trillion in economic benefits in the coming years. And this revolution is being powered by agents.

Agentic systems can go off and find answers to questions, monitor sensors, or code projects on their own without line-by-line instructions. We’re seeing the first wave of this powerful new wave of working now. IBM itself just announced a suite of agents at Think 2025 for tasks in several areas workers often spend their time working on repetitive tasks, including HR, procurement, and sales.

But with every new technology, there are questions that developers and engineers need answered before they will feel comfortable rolling it out at scale. How do you know the agent you’ve built is actually running as you intended?

This is what researchers at IBM have been working to solve. Agents tend to rely on dynamic workflows and non-deterministic logic, while interacting with other pieces of software, APIs, tools, and even other agents. They allow users to create workflows that may have previously been time-consuming, but their developers, and eventual users, need to know they will work exactly as intended each time. When you look under the hood of a running car, a mechanic knows what should be happening — it’s no different with agents.

The sophistication of modern agents brings new challenges: unpredictability in how they execute tasks, variability in output quality, and even potentially shifts in behavior over time. These issues mean conventional monitoring, debugging, testing, and maintenance methods are insufficient for tracking the operations of agents.

This has led a team of researchers at IBM to dive into the world they’re calling AgentOps, to allow developers and system maintainers that same under-the-hood examination for their agentic systems. It enables users to understand and investigate how agents make decisions, what their memory states are, and how they utilize external tools. It tracks anomalies and regressions, and should carry out real-time introspection compare results against previous investigations. The goal isn’t just to make these systems observable, but to make them iteratively improve and adaptive to changes — as well as accountable to those who govern them.

The work that IBM Research is doing will make its way into major IBM products that organizations can utilize in the future, including IBM Instana and watsonx Orchestrate. At this year’s Think conference, IBM unveiled its AgentOps solution, a powerful new set of tools designed to help organizations observe, analyze, and optimize the use of AI agents at scale. Launching as part of watsonx products, AgentOps brings observability and control to increasingly autonomous agentic systems.

The AgentOps team identified three core areas to focus on to ensure proper support for enterprise agentic AI use cases.

The first is building on top of OpenTelemetry (OTEL) standards, an open-source software development kit (SDK). It provides automatic and manual instrumentations for agentic frameworks like LangChain, watsonx, CrewAI, and LangGraph for both automatic and manual instrumentation. AgentOps takes those standards and expands them to treat agents, tasks, and tools as key parts of the system, making sure that important information flows smoothly between them and the environments where the tools run.

The team also built an open analytics platform on top of OTEL which allows users to understand how their agentic systems are working in great detail. It enables users to investigate an agent’s behavior, and even make recommendations or automatic changes to optimize how they carry out their tasks. And it’s extensible, meaning AI researchers and practitioners interested in adding in new ways of looking at analytics, or new metrics, can easily do so.

The analytics that come baked in are themselves powered by AI, meaning they can generate unique perspectives on the tasks at hand, including multi-trace workflow views, and trajectory explorations. This all leads to a technology that can truly optimize agentic systems for accuracy, latency, and cost, as well as recommending improvements to prompts, how the LLMs are used, and agent workflows.

To date, IBM Research has successfully used AgentOps tech to assist in the building of various agents targeted for a range of IBM automation software products, including Instana, Concert and Apptio. As agentic AI systems are becoming more human-like in the way they can course-correct while working through a task, the tools that monitor them need to be able to do the same — in the places that developers and engineers are used to working. It should be simple to see that agents are improving a company’s performance, lowering costs to get things done. That’s just what the team is aiming to achieve with AgentOps.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

IBM Storage Scale delivers real-world performance: an in-depth analysis
Technical note
Brian Belgodere, Chris Miller, John Lewars, Matthew Klos, Yukio Hayashi Leon, Mara Miranda Bautista, and Olaf Weiser
04 Aug 2025
- AI
- Hybrid Cloud Infrastructure
Debugging LLMs to improve their credibility
Research
Kim Martineau
30 Jul 2025
From simulated steps to real-world care: AI learns how we walk for neurology
Research
Peter Hess
29 Jul 2025
Can LLMs learn social skills by playing games?
Research
Kim Martineau
23 Jul 2025
- AI
- Generative AI

Related posts

IBM Storage Scale delivers real-world performance: an in-depth analysis

Debugging LLMs to improve their credibility

From simulated steps to real-world care: AI learns how we walk for neurology

Can LLMs learn social skills by playing games?