Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

VAKRA is a new executable benchmark designed to evaluate AI agents' compositional reasoning and tool-use capabilities in enterprise environments. Featuring over 8,000 APIs across 62 domains, it highlights the current limitations of LLMs in complex, multi-step workflows.
VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard
We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments.
Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows.
VAKRA provides an executable environment where agents interact with over 8,000+ locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints.
As can be seen below, models perform poorly on VAKRA - in this blog, we include additional dataset details about the tasks in VAKRA and present an analysis of failure modes we observed on different tasks.
As shown below, the VAKRA benchmark comprises of four tasks, each testing a different set of capabilities.
Fig 1: Representative examples of each capability in the VAKRA benchmark
This capability includes 2,077 test instances across 54 domains, requiring the use of tools from the SLOT-BIRD and SEL-BIRD collections (Elder et al., 2026). Compared to the setup in Elder et al., the tool universe in SLOT-BIRD and SEL-BIRD is expanded through the inclusion of a larger number of domains. Each domain is restricted to one tool collection, and tasks involve chaining 1–12 tool calls to arrive at the final answer.
{
"query": "Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?",
"tool_calls":[
{
"name": "get_data",
"arguments":{"tool_universe_id":"486ea46224d1-aeb8037c5e78"},
"label": "retrieved_data_1"
},
{
"name": "select_data_equal_to",
"arguments":{"data_label":"retrieved_data_1","key_name":"play_speed","value":31},
"label": "FILTERED_DF_0"
},
{
"name": "select_data_equal_to",
"arguments":{"data_label":"FILTERED_DF_0","key_name":"play_dribble","value":53},
"label": "FILTERED_DF_1"
},
{
"name": "select_data_equal_to",
"arguments":{"data_label":"FILTERED_DF_1","key_name":"play_passing","value":32},
"label": "FILTERED_DF_2"
},
{"name":{get_team_name},"arguments":{"data_label":"FILTERED_DF_2","n":1}}],
"answer": "FC Barcelona"
}
Fig 2: Data sample from SEL-BIRD collection
As shown above, each instance has an associated JSON data source from which the answer must be derived. The MCP servers supporting this task include a special tool, called get_data(tool_universe_id=id), which must be called at the beginning of each instance. This tool initializes the data source, returns a lightweight preview of the data, and stores the full dataset server-side to avoid large data transfers. This prevents the inefficient transfer of large data over the MCP protocol. The call also configures the MCP server to expose the appropriate tool set based on the tool_universe_id and aligns the data source with the domain-specific database for the instance.
The SLOT-BIRD collection provides a global set of 7 tools for generic data manipulation (e.g., filtering, sorting), inspired by systems like Tableau and Google Analytics. The SEL-BIRD collection extends this by introducing more specialized tools: some are shared with SLOT-BIRD, while others are derived by flattening categorical arguments into separate functions. Additionally, the generic (retrieve_data) function from SLOT-BIRD is replaced with query-specific getters. Every key in the data for a given instance has an associated get function (get_KEY_NAME) for an average of 4 get functions per instance.
{
"handle": "retrieved_data_1",
"num_records": 2,
"key_details": [
{"name": "team_name", "dtype": "str", "first_3_values": ["FC Barcelona", "Manchester City"]},
{"name": "play_speed", "dtype": "int32", "first_3_values": [31, 40]},
{"name": "play_dribble", "dtype": "int32", "first_3_values": [53, 30]},
{"name": "play_passing", "dtype": "int32", "first_3_values": [32, 16]}
]}
Fig 3: Data preview obtained from get_data function
This capability includes 1,597 instances across 17 domains, requiring tools from an expanded REST-BIRD collection (Elder et al.). These use endpoint-style interfaces that provide highly specific, query-aligned endpoints that encapsulate most computation. They are served as REST APIs running in a FastAPI server, which is wrapped by the MCP server. This task requires selecting the correct APIs from the domain-specific tool set. Each domain contains a minimum of 6 to a maximum of 328 tools (with an average of 116 tools). Similar to the previous task, the get_data tool configures the MCP server to expose only the relevant domain-specific APIs.
The OpenAI API Specification restricts the tool list input to a maximum length of 128 tools. This restriction requires an agent builder using this API to manage the length of the tool list directly via a shortlisting mechanism. In the baseline agents in our repository, a simple shortlisting capability handles this challenge.
The Capability 3 segment of the benchmark has 869 test instances drawn from 38 subject domains. These instances rely again on the REST-BIRD API collection, but add multi-hop reasoning to the challenge. Multi-hop questions require multiple pieces of supporting evidence to be extracted and combined to reach an answer. The instances in this section require between one and five logical hops to answer a query.
Capability 4 includes 644 instances across 41 domains and is also built on the REST-BIRD API collection. It contains the most complex queries with the following characteristics:
Multi-Source: This segment adds document indices per domain. Queries in this capability could require information from these document indexes as well as API calls. Similar to Capability 3, this task also has Multi-Hop queries. The required information source applies at the per-hop level, so, for example, a question may entail three logical hops with sources: API - RAG (Document Retrieval) - API. To enforce correct reasoning, sources are decontaminated during data generation.
Multi-Turn: This segment of the dataset also adds multi-turn conversations to the setting. Each instance is a dialog with multiple turns. The data is released as context-response pairs, where the context encodes the current dialog history and the agent is only responsible for answering the current turn.
Tool-usage Policies: A subset of these instances includes tool-use policies that the agent is required to follow. These policies take the form of plain-text instructions about the knowledge sources that the agent is allowed to access and under which circumstances.
Source: Hugging Face Blog













