AGENTIC-SYSTEMSJune 12, 20261 min read13 views

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Researchers introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows, significantly boosting execution feasibility for small planners.

Computer Science > Artificial Intelligence

Title:Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

View PDFAbstract:Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Teaching LLMs to Update Beliefs for Efficient Long-Horizon Interaction

As LLMs tackle longer tasks, retaining complete interaction histories becomes contextually and computationally expensive. The ABBEL framework addresses this by isolating summaries into natural-language 'belief states' and supervising them via belief grading, recovering accuracy while significantly cutting memory usage.

NOW LET US Related – Stochastic Sampling is Epistemically Shallow: The Dimensionality Gap Between Temperature Variation and Model Diversity in LLMs

agentic-systems

Stochastic Sampling is Epistemically Shallow: The Dimensionality Gap Between Temperature Variation and Model Diversity in LLMs

A new study reveals that stochastic sampling via temperature variation in a single LLM only provides per-question uncertainty, failing to capture complex cross-question epistemic structures compared to a diverse ensemble of distinct models.

NOW LET US Related – AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

agentic-systems

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

AINTMA introduces an agentic AI framework featuring six specialized autonomous agents to transform enterprise software test management. Evaluated across 12 projects over 18 months, the system cut test cycle times by 43% and reduced defect escape rates to 2.1%.

agentic-systems

Marking the Wrong Symptoms: Evaluating LLM Watermarks in Medical Texts

A new comprehensive study reveals that applying LLM watermarking schemes in the medical domain leads to severe performance degradation, inducing lexical corruption, hallucinations, and misinterpretations in clinical reasoning.

agentic-systems

ClickGuard: Detecting and Spoiling Clickbait News with Informativeness Measures and Large Language Models

Researchers have introduced ClickGuard, an AI-powered browser extension designed to detect and spoil clickbait news. Utilizing LLM embeddings and an XGBoost architecture, the tool achieves a 91% F1-score while providing concise article summaries.

agentic-systems

DecodeShare: Tracing the Shared Subspace of LLM Decode-Time Decisions

Researchers introduce DecodeShare, a protocol identifying a low-dimensional shared subspace in LLM decode-time hidden states that plays a crucial causal role in decision-making and activation steering.