A New Framework for Evaluation of Voice Agents (EVA)

EVA is the first end-to-end framework to jointly evaluate task accuracy and conversational experience for voice agents, revealing a consistent tradeoff between these two critical dimensions.

Conversational voice agents present a distinct evaluation challenge: they must simultaneously satisfy two objectives — accuracy (completing the user's task correctly and faithfully) and conversational experience (doing so naturally, concisely, and in a way appropriate for spoken interaction). These objectives are deeply intertwined: mishearing a confirmation code renders perfect LLM reasoning meaningless, a wall of options overwhelms a caller who can't skim spoken output, and delayed responses can pass every accuracy check while remaining unusable in practice. Existing frameworks treat these as separate concerns — evaluating task success or conversational dynamics, but not both. We introduce EVA, an end-to-end evaluation framework for conversational voice agents that evaluates complete, multi-turn spoken conversations using a realistic bot-to-bot architecture. EVA produces two high-level scores, EVA-A (Accuracy) and EVA-X (Experience), and is designed to surface failures along each dimension. EVA is the first to score task success and conversational experience jointly. We release EVA with an initial airline dataset of 50 scenarios covering flight rebooking, cancellation handling, vouchers, and more — the first in a planned series of domains.

We also provide benchmark results for 20 cascade and audio-native systems, such as speech-to-speech models and large audio language models. Our biggest finding is that there is a consistent Accuracy-Experience tradeoff; agents that perform well on task completion tend to deliver worse user experiences, and vice versa.

Code, dataset, and judge prompts are fully open-sourced at https://github.com/ServiceNow/eva.

The field currently lacks a framework that evaluates the full quality of voice agent interactions, as most existing efforts assess individual components in isolation. For example, AudioBench, SD-Eval, VoxEval, Kimi-Eval, VoiceBench and VoxDialogue evaluate core speech understanding capabilities — transcription, paralinguistics, acoustic cues — but remain confined to single-turn, non-interactive settings. On the other hand, EmergentTTS and SHEET assess perceived speech quality using subjective listening tests (e.g., Mean Opinion Score). Beyond speech perception, FD-Bench, Talking Turns, Full-Duplex-Bench provide deeper analyses of conversational dynamics — interruptions, backchanneling, turn-taking — yet evaluate these in isolation from task-oriented tool use, leaving the relationship between dialogue quality a

Source: Hugging Face Blog