A New Framework for Evaluating Voice Agents (EVA)

EVA is the first end-to-end evaluation framework that jointly measures task accuracy and conversational experience for voice agents, revealing a consistent trade-off between the two.

Conversational voice agents present a distinct evaluation challenge: they must simultaneously satisfy two objectives — accuracy (completing the user's task correctly and faithfully) and conversational experience (doing so naturally, concisely, and in a way appropriate for spoken interaction). These objectives are deeply intertwined: mishearing a confirmation code renders perfect LLM reasoning meaningless, a wall of options overwhelms a caller who can't skim spoken output, and delayed responses can pass every accuracy check while remaining unusable in practice. Existing frameworks treat these as separate concerns — evaluating task success or conversational dynamics, but not both.

We introduce EVA, an end-to-end evaluation framework for conversational voice agents that evaluates complete, multi-turn spoken conversations using a realistic bot-to-bot architecture. EVA produces two high-level scores, EVA-A (Accuracy) and EVA-X (Experience), and is designed to surface failures along each dimension. EVA is the first to jointly score task success and conversational experience. We release EVA with an initial airline dataset of 50 scenarios covering flight rebooking, cancellation handling, vouchers, and more — the first in a planned series of domains.

We also provide benchmark results for 20 cascade and audio-native systems, such as speech-to-speech models and large audio language models. Our biggest finding is that there is a consistent Accuracy-Experience tradeoff; agents that perform well on task completion tend to deliver worse user experiences, and vice versa.

The field currently lacks a framework that evaluates the full quality of voice agent interactions, as most existing efforts assess individual components in isolation. For example, AudioBench, SD-Eval, VoxEval, Kimi-Eval, VoiceBench and VoxDialogue evaluate core speech understanding capabilities — transcription, paralinguistics, acoustic cues — but remain confined to single-turn, non-interactive settings. On the other hand, EmergentTTS and SHEET assess perceived speech quality using subjective listening tests (e.g., Mean Opinion Score). Beyond speech perception, FD-Bench, Talking Turns, Full-Duplex-Bench provide deeper analyses of conversational dynamics — interruptions, backchanneling, turn-taking — yet evaluate these in isolation from task-oriented tool use, leaving the relationship between dialogue quality and agentic capability unexamined. More recent efforts, notably VoiceAgentBench and CAVA, take steps towards evaluating the agentic capabilities of commercial voice agent systems, including tool-calling and complex instruction-following. However, these voice-agentic capabilities are not evaluated within complete conversational workflows that voice agents must navigate in practice: from initial user request through multi-step tool orchestration to final task resolution.

The lack of frameworks that jointly capture accuracy and experience underscores the need for a framework that treats voice agent quality as an integrated whole. This means evaluating not only whether the task succeeded, but whether the agent communicated accurately, concisely, and naturally throughout, and surfacing how these dimensions trade off against one another in realistic deployment conditions.

End-to-end evaluation reveals interaction dynamics that are not apparent at the component level: whether the agent interrupts users during natural pauses in speech, whether it recovers smoothly when a user corrects a transcription error, or whether high latency disrupts the conversational flow enough to prompt users to repeat themselves or abandon the task entirely.

EVA simulates multi-turn spoken conversations over live audio in which the agent must invoke appropriate tools, adhere to task-specific policies, and reach a deterministically verifiable end state. EVA evaluates voice agents using a bot-to-bot audio architecture composed of five core components:

User Simulator— A conversational AI configured with a specific goal and persona that plays the role of a caller. It operates in audio using high-quality TTS models, ensuring the evaluation captures representative speech-understanding challenges in natural-sounding conversational speech and realistic turn-taking dynamics.
Voice Agent— The voice agent being evaluated, built with Pipecat, an open-source Python framework for real-time voice applications. EVA supports both cascade architectures (STT → LLM → TTS) and audio-native models (S2S or S2T→ TTS).
Tool Executor— The engine that provides deterministic, reproducible tool responses via custom Python functions. It dynamically queries and modifies a predefined per-scenario database.
Validators— A set of validation metrics that check that conversations are complete and that the user faithfully reproduced the intended behavior and speech, with no human annotation required.
Metrics Suite— A suite of metrics evaluates the voice agent using the conversation recording, transcript, and tool call logs.

Each test case (scenario) in our framework is an evaluation record, structured to make tests reproducible: User Goal, User Persona, Scenario Database, and Ground Truth.

We release EVA with a synthetic airline dataset of 50 scenarios, spanning IRROPS rebooking, voluntary itinerary changes, cancellations, same-day standby, and compensation vouchers. Scenarios are designed to test temporal reasoning, policy-following, constraint satisfaction, and named-entity handling.

EVA evaluates voice agents across two fundamental dimensions, EVA-A for accuracy, and EVA-X for experience. EVA also includes a set of diagnostic metrics. Unlike the primary metrics, these are not used directly to compare or rank models — rather, they offer granular insight into why a model scores the way it does, helping identify and understand specific failure modes (e.g., ASR, speech synthesis, etc.). We report pass@k and pass^k across three trials per scenario (k = 3), capturing both peak performance and behavioral consistency.

EVA uses two evaluation methods: deterministic code-based metrics, which compute scores directly from structured data and are fast; and LLM-as-Judge metrics, which use Large Language Models (LLMs) to assess qualitative aspects of the conversation, or Large Audio Language Models (LALM) to evaluate speech directly. Each judge-based metric uses the model that performs best on a curated evaluation dataset for that specific metric.

Task completion alone is a necessary but insufficient measure of accuracy. An agent can reach the correct end state while fabricating a policy detail, misreading a confirmation code aloud, or hallucinating a flight number mid-conversation. These failures are invisible to a binary pass/fail check but directly harm users. EVA-A therefore measures three dimensions of accuracy: Task Completion, Policy Following, and Information Integrity.

Source: Hugging Face Blog