Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic

While LLMs offer massive context windows, scalable enterprise AI adoption requires 'agent logic'—software primitives like knowledge graphs and program analysis—to drive cost-effectiveness, accuracy, and trust.

Guides have aided humanity throughout history. Prehistoric civilizations understood that the sun and the moon could be used to navigate vast distances on land and the high seas. Over time, various journeys facilitated the production of maps for better planning and faster travel time to repeat destinations. Centuries later, the introduction of the compass enabled seagoers to achieve greater accuracy in seeking unexplored destinations. And today, GPS navigation apps guide our every journey. In today’s world of agentic AI, AI agents, admittedly, have the potential to enable scalable AI adoption, transforming industries as we know them. However, an intelligent guide, agentic logic, is needed to realize this potential by fueling high agent quality, cost-effectiveness, and consequent end-user trust.

Enterprise Workflows & Use Cases

Numerous studies have cited the overwhelming failure of AI pilots, while others have also highlighted the need for AI to operate at the core of enterprise workflows to enable scalable adoption. [1] [2] To better understand this phenomenon and the associated assertion, some analysis of enterprise workflows is required. These workflows are:

A. Dynamic and long-running

B. Possess a plethora of APIs, databases and services

C. Oftentimes are constrained by business policies and/or regulations

For an agent to function effectively, given these above characteristics, naturally demands an expanded model context, which state-of-the-art frontier LLMs certainly possess, but at what tradeoff? Increased hallucinations, token consumption? Further, can LLMs be equipped with an intelligent guide, GPS, to enable agentic AI execution at the core of the workflow, driving more desirable outcomes? We tested these hypotheses by designing and building agents, equipped with pertinent agent logic, for IBM offerings fully considering the above characteristics. These offerings pertain to some of the most challenging tasks confronting subject matter experts who own various stages of the enterprise software delivery lifecycle for mission critical workloads including:

Understanding applications written in legacy code (Cobol / PL/1)
Expediting test generation for developers
Proactively responding to incidents and enabling shift-left app resiliency
Automating compliance modernization for critical environments

Before examining each of these domains in detail, let us define what characterizes agent logic. Agent logic is software primitives, such as knowledge graphs, algorithms, program analysis libraries, which operate at the agentic layer (within an agent harness) and can intentionally steer the LLM in the direction of the enterprise workflow, reducing the context space. In so doing, have strong tendency to drive more performant outcomes in a more cost-effective manner. Let us now examine how agent logic is able to achieve such outcomes in each of the above four domains.

Understanding applications written in legacy code (Cobol / PL/1) - program analysis.[3]

IBM watsonx Code assistant for Z (WCA4Z), used to accelerate mainframe application development and modernization with AI and automation, is equipped with an App Insights agent for application understanding - one of the primary focus areas of enterprise clients running mission critical workloads on IBM mainframe. This agent leverages deep static analysis across the application and stores a pre-indexed representation in a database schema that spans hundreds of interrelated tables with complex semantics, allowing the agent to retrieve precise, structured already available information; thereby improving answer accuracy, reducing token usage, and minimizing back-and-forth interactions with the language model (Mistral Medium 250B in this instance). This approach when applied to multiple mission-critical legacy systems (up to 1M lines of code and 1K programs) maintains marginally superior app understanding performance with ~30× lower token consumption than a baseline frontier LLM-only approach.

Expediting test generation for developers with Aster - program analysis. [4], [5]

Aster is an IBM proprietary program analysis and data pre- and post-processing-based library utilized for agent-based generation of unit, integration, API and change-based tests; which from analysis of multiple developer communities achieves higher developer ratings compared with various open-sourced tools or developer-written tests. Based on the latter and superior line, branch and method coverage benchmarks compared with similar open-sourced tools (integration tests) and zero-shot LLMs and coding agents (unit tests), all tested on open-sourced applications, we have been running Aster in pre-production mode on 75+ java IBM CIO applications (up to 560+ classes and 67K+ lines of code) with Devstral 24B model. Steady-state results to date yield +20% - 45% improvement in line, branch and method coverage coupled with superior performance on a subset of these apps compared with state-of-the-art coding agent with orders of magnitude lower token consumption (up to 15×). The rationale for these results is that the program analysis output (used to prompt and “focus” the LLM) coupled with sub-agents for augmenting coverage and remediating runtime and compilation errors enable a more performant outcome with significant cost reduction.

Proactively responding to incidents and enabling shift-left app resiliency - knowledge graphs, program analysis libraries and investigation (observability) - driven orchestration. [6],[7]

While LLM context for app-related use cases as described in 1 and 2 are “restricted” to the app source code, for runtime management of apps on deployed infra, the underlying IT full stack comes into play. Here we define a knowledge graph (KG) encompassing entities (microservices, database/middleware services, MELT etc.) coupled with embedded (“tribal”) knowledge from domain experts. With such a graph and bounding the LLM to local bound reasoning for non-deterministic outcomes, an observability-driven approach is used to achieve reduced context space spanning the IT stack and underlying app source code (if relevant) for incident root cause analysis (and other use cases). With this approach, leveraging the equivalent Instana data model, we have seen the proprietary Instana “I3” (intelligent incident investigation [8]) agent achieve up to 4.0× improvement over ReAct agent with GPT-5.1 as measured using ITBench [9]. With Gemini 3 Flash the ReAct agent performance improves to within 17% lower than the I3 agent while consuming 1.6× more tokens, We have extended this approach to source code with agents for code analysis (leveraging program dependency graphs) and bug remediation (leveraging inference scaling), also tested on ITBench, illustrating superior performance for the source code analysis and bug remediation agents (Gemini 2.5 Flash) over state-of-the-art coding agent both for finding the culpable microservice (3.0×) and bug repair (1.6×) while consuming respectively 3.7× and 5.9× less tokens. This multi-agent system was announced at IBM Think as part of the newly unveiled IBM Concert Platform for shift-left IT Operations and is also being piloted internally with IBM CIO. [10]

Automating IT compliance modernization for critical environments - algorithms and adaptive planning and orchestration. [11]

Enterprises face increasingly complex and fragmented compliance requirements, forcing teams to spend considerable time manually creating controls, assessments and remediation plans. No centralized knowledge exists and fixes are written manually, which introduces a risk of errors and security gaps. Because compliance work is complex and multi-step, it requires coordinated policy-driven automation across specialized agents rather than manual effort or simple AI prompts. Our multi-agent system automates compliance by algorithmically decomposing complex tasks into coordinated steps, using adaptive planning, dynamic

Source: Hugging Face Blog