Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
[go: Go Back, main page]

Papers
arxiv:2605.13841

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Published on May 13
ยท Submitted by
Orlando Marquez
on May 14

Abstract

EVA-Bench presents a comprehensive evaluation framework for voice agents that simulates realistic conversations and measures performance across multiple voice-specific failure modes using novel accuracy and experience metrics.

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

Community

Paper submitter

How do you know a voice agent is good? Task completion isn't enough. A voice agent can call the correct tools and still misread a confirmation code, fabricate a policy detail, or respond so slowly a caller hangs up. Catching those failures requires evaluation that goes beyond transcripts โ€” and beyond a single domain or acoustic condition.
Today, we're releasing ๐—˜๐—ฉ๐—”-๐—•๐—ฒ๐—ป๐—ฐ๐—ต โ€” designed to surface exactly that.
๐Ÿข ๐—ง๐—ต๐—ฟ๐—ฒ๐—ฒ ๐—ฒ๐—ป๐˜๐—ฒ๐—ฟ๐—ฝ๐—ฟ๐—ถ๐˜€๐—ฒ ๐—ฑ๐—ผ๐—บ๐—ฎ๐—ถ๐—ป๐˜€. We've scaled from a single dataset to three: ๐—›๐—ฅ, ๐—œ๐—ง๐—ฆ๐— , and ๐—–๐—ฆ๐— . Because the best voice agent for customer service isn't necessarily the best one for HR or IT support.

Paper submitter

If you prefer the video/audio modality, please checkout the podcast about this work: https://www.youtube.com/watch?v=x7Ks932T18o

the most interesting detail here is how EVA-A and EVA-X knit together end-to-end evaluation with two composite scores that cover both accuracy and experience across architectures. EVA-A aggregates task completion, faithfulness, and audio fidelity, while EVA-X targets progression, spoken conciseness, and turn-taking timing, which makes the end-to-end signal richer than typical transcript-based checks. the automatic simulator validation gate, which regenerates conversations when drift or quality issues are detected, is a clever guardrail, but i wonder how much it shifts the distribution of failure modes versus real-world user drift. a helpful ablation would be to disable the validation regeneration and compare score distributions, to see how much the gate itself shapes the benchmark. the arXivLens breakdown helped me parse the method details and complements EVA-Bench nicely, especially for tracing the multi-turn dynamics that the metrics hinge on (https://arxivlens.com/PaperView/Details/eva-bench-a-new-end-to-end-framework-for-evaluating-voice-agents-7211-04160d05).

ยท
Paper author

thanks for your comment! yeah its a great point about the regeneration. in our simulations, more than half of the conversations that get regenerated are due to timeouts and other simulation failures between the voice agent and user simulator. these are artifacts of our simulation environment and likely would not happen in a production setting. the remaining regenerations are typically due to the user simulator drifting from its instructions which makes it tough to compare against real-world users. removing this gate would likely lower task completion scores. however, failures would likely stem from one of two causes: the conversation stalling due to a pipeline issue, or the user not following the instructions, preventing it from reaching the ground truth outcome.

when the simulation is regenerated due to user instruction following issues, it's often due to the user calling the end_call tool too soon and hanging up before the agent has gotten a chance to finish the task. this often happens in situations where the agent says "ok i will handle that now" or something like that and seems likely to call the tools to handle the task on the next turn, but the user goes ahead and ends the call on that turn (though we instruct it to wait for confirmation that the agent has actually completed the task). or the agent says "ok do you want me to finalize x" and the user says "yes" and then ends the call in the same turn (though again it is instructed not to do so).

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.13841
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.13841 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.13841 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.