Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in
Real-world Applications
https://vitabench.github.io/\n","updatedAt":"2025-10-01T02:08:43.287Z","author":{"_id":"66ecee857264238429a1211f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ecee857264238429a1211f/TbuM7ToLBrSxDF8mOccpK.jpeg","fullname":"Wei He","name":"hewei2001","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8008732795715332},"editors":["hewei2001"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66ecee857264238429a1211f/TbuM7ToLBrSxDF8mOccpK.jpeg"],"reactions":[],"isReport":false}},{"id":"68ddd75563b7aadfafa50778","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-10-02T01:37:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries](https://huggingface.co/papers/2508.15760) (2025)\n* [How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ-bench](https://huggingface.co/papers/2508.20931) (2025)\n* [OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows](https://huggingface.co/papers/2508.09124) (2025)\n* [MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use](https://huggingface.co/papers/2509.24002) (2025)\n* [OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks](https://huggingface.co/papers/2508.05614) (2025)\n* [TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?](https://huggingface.co/papers/2509.22715) (2025)\n* [MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use](https://huggingface.co/papers/2508.16260) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-10-02T01:37:25.496Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7396382093429565},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2509.26490","authors":[{"_id":"68dc8ca34159d1f2418f9a7e","user":{"_id":"66ecee857264238429a1211f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ecee857264238429a1211f/TbuM7ToLBrSxDF8mOccpK.jpeg","isPro":false,"fullname":"Wei He","user":"hewei2001","type":"user"},"name":"Wei He","status":"claimed_verified","statusLastChangedAt":"2025-10-01T10:22:01.757Z","hidden":false},{"_id":"68dc8ca34159d1f2418f9a7f","name":"Yueqing Sun","hidden":false},{"_id":"68dc8ca34159d1f2418f9a80","name":"Hongyan Hao","hidden":false},{"_id":"68dc8ca34159d1f2418f9a81","name":"Xueyuan Hao","hidden":false},{"_id":"68dc8ca34159d1f2418f9a82","name":"Zhikang Xia","hidden":false},{"_id":"68dc8ca34159d1f2418f9a83","name":"Qi Gu","hidden":false},{"_id":"68dc8ca34159d1f2418f9a84","name":"Chengcheng Han","hidden":false},{"_id":"68dc8ca34159d1f2418f9a85","name":"Dengchang Zhao","hidden":false},{"_id":"68dc8ca34159d1f2418f9a86","name":"Hui Su","hidden":false},{"_id":"68dc8ca34159d1f2418f9a87","name":"Kefeng Zhang","hidden":false},{"_id":"68dc8ca34159d1f2418f9a88","name":"Man Gao","hidden":false},{"_id":"68dc8ca34159d1f2418f9a89","name":"Xi Su","hidden":false},{"_id":"68dc8ca34159d1f2418f9a8a","name":"Xiaodong Cai","hidden":false},{"_id":"68dc8ca34159d1f2418f9a8b","name":"Xunliang Cai","hidden":false},{"_id":"68dc8ca34159d1f2418f9a8c","name":"Yu Yang","hidden":false},{"_id":"68dc8ca34159d1f2418f9a8d","user":{"_id":"64268def7a0d3f02acd4bdd4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/exl68RpkgSPyhfBYZG3SJ.png","isPro":false,"fullname":"Corsky zhao","user":"Corsky","type":"user"},"name":"Yunke Zhao","status":"claimed_verified","statusLastChangedAt":"2025-10-15T07:08:17.623Z","hidden":false}],"publishedAt":"2025-09-30T16:33:49.000Z","submittedOnDailyAt":"2025-10-01T00:38:43.265Z","title":"VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in\n Real-world Applications","submittedOnDailyBy":{"_id":"66ecee857264238429a1211f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ecee857264238429a1211f/TbuM7ToLBrSxDF8mOccpK.jpeg","isPro":false,"fullname":"Wei He","user":"hewei2001","type":"user"},"summary":"As LLM-based agents are increasingly deployed in real-life scenarios,\nexisting benchmarks fail to capture their inherent complexity of handling\nextensive information, leveraging diverse resources, and managing dynamic user\ninteractions. To address this gap, we introduce VitaBench, a challenging\nbenchmark that evaluates agents on versatile interactive tasks grounded in\nreal-world settings. Drawing from daily applications in food delivery, in-store\nconsumption, and online travel services, VitaBench presents agents with the\nmost complex life-serving simulation environment to date, comprising 66 tools.\nThrough a framework that eliminates domain-specific policies, we enable\nflexible composition of these scenarios and tools, yielding 100 cross-scenario\ntasks (main results) and 300 single-scenario tasks. Each task is derived from\nmultiple real user requests and requires agents to reason across temporal and\nspatial dimensions, utilize complex tool sets, proactively clarify ambiguous\ninstructions, and track shifting user intent throughout multi-turn\nconversations. Moreover, we propose a rubric-based sliding window evaluator,\nenabling robust assessment of diverse solution pathways in complex environments\nand stochastic interactions. Our comprehensive evaluation reveals that even the\nmost advanced models achieve only 30% success rate on cross-scenario tasks, and\nless than 50% success rate on others. Overall, we believe VitaBench will serve\nas a valuable resource for advancing the development of AI agents in practical\nreal-world applications. The code, dataset, and leaderboard are available at\nhttps://vitabench.github.io/","upvotes":20,"discussionId":"68dc8ca44159d1f2418f9a8e","projectPage":"https://vitabench.github.io/","githubRepo":"https://github.com/meituan-longcat/vitabench","githubRepoAddedBy":"user","ai_summary":"VitaBench is a benchmark for evaluating LLM-based agents in complex, real-world interactive tasks using a diverse set of tools and scenarios.","ai_keywords":["LLM-based agents","VitaBench","interactive tasks","real-world settings","food delivery","in-store consumption","online travel services","life-serving simulation environment","domain-specific policies","flexible composition","cross-scenario tasks","single-scenario tasks","real user requests","temporal dimensions","spatial dimensions","complex tool sets","ambiguous instructions","shifting user intent","multi-turn conversations","rubric-based sliding window evaluator","stochastic interactions"],"githubStars":83,"organization":{"_id":"68b28d79a176a9beb30d2049","name":"meituan-longcat","fullname":"LongCat","avatar":"https://cdn-uploads.huggingface.co/production/uploads/68a2a29ab9d4c5698e02c747/CDCAx7X7rXDt7xjI-DoxG.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66ecee857264238429a1211f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ecee857264238429a1211f/TbuM7ToLBrSxDF8mOccpK.jpeg","isPro":false,"fullname":"Wei He","user":"hewei2001","type":"user"},{"_id":"643910dbabdc6ce5351e4eb5","avatarUrl":"/avatars/92ec189cd4325b4d85fdfcd59f1ff1e3.svg","isPro":false,"fullname":"Yueqing Sun","user":"leqing","type":"user"},{"_id":"619ddd708ae9cafd72ab20d5","avatarUrl":"/avatars/6b44e4928de0fc27287bf922c3f1802d.svg","isPro":false,"fullname":"Chengcheng Han","user":"hccngu","type":"user"},{"_id":"66f547c5405af2cd3256ba27","avatarUrl":"/avatars/fc52bac1785191af0eb9a1d108c4397e.svg","isPro":false,"fullname":"JP Zhu","user":"JPZhu","type":"user"},{"_id":"64eb4f5507987950ae5e2b0f","avatarUrl":"/avatars/77dc21b195bc94490e45bfe208abfbc4.svg","isPro":false,"fullname":"Jiapeng Zhu","user":"JasonZhujp","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"650819ad95c45596854271a3","avatarUrl":"/avatars/9517fd62eef104c33884dfb3f64249bf.svg","isPro":false,"fullname":"wen","user":"cindywen","type":"user"},{"_id":"63340dbbd92c5842ae71d1e9","avatarUrl":"/avatars/3a3182996bd41b526dcbfa8687d91963.svg","isPro":false,"fullname":"Kanzhi Cheng","user":"cckevinn","type":"user"},{"_id":"636f526a6cd69d9a36ff2b53","avatarUrl":"/avatars/8f2271a193fcac609d9be270552b5afa.svg","isPro":false,"fullname":"Qiguang Chen","user":"LightChen2333","type":"user"},{"_id":"66ba08d4bafd993a5428e051","avatarUrl":"/avatars/2bd7779582ee3ae2ff9688e1627d745e.svg","isPro":false,"fullname":"Kangyang Luo","user":"lKangyang","type":"user"},{"_id":"64beb69801f1983a86a05de2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64beb69801f1983a86a05de2/tFyCoqZ6gT8NWkZfuncID.jpeg","isPro":false,"fullname":"Chuanyang Jin","user":"Chuanyang-Jin","type":"user"},{"_id":"63a3eb8af460e4379b5991e7","avatarUrl":"/avatars/7564a048d8496cac38d689178d90a8f9.svg","isPro":false,"fullname":"Xiaohan Xu","user":"Tebmer","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"68b28d79a176a9beb30d2049","name":"meituan-longcat","fullname":"LongCat","avatar":"https://cdn-uploads.huggingface.co/production/uploads/68a2a29ab9d4c5698e02c747/CDCAx7X7rXDt7xjI-DoxG.png"}}">
VitaBench is a benchmark for evaluating LLM-based agents in complex, real-world interactive tasks using a diverse set of tools and scenarios.
AI-generated summary
As LLM-based agents are increasingly deployed in real-life scenarios,
existing benchmarks fail to capture their inherent complexity of handling
extensive information, leveraging diverse resources, and managing dynamic user
interactions. To address this gap, we introduce VitaBench, a challenging
benchmark that evaluates agents on versatile interactive tasks grounded in
real-world settings. Drawing from daily applications in food delivery, in-store
consumption, and online travel services, VitaBench presents agents with the
most complex life-serving simulation environment to date, comprising 66 tools.
Through a framework that eliminates domain-specific policies, we enable
flexible composition of these scenarios and tools, yielding 100 cross-scenario
tasks (main results) and 300 single-scenario tasks. Each task is derived from
multiple real user requests and requires agents to reason across temporal and
spatial dimensions, utilize complex tool sets, proactively clarify ambiguous
instructions, and track shifting user intent throughout multi-turn
conversations. Moreover, we propose a rubric-based sliding window evaluator,
enabling robust assessment of diverse solution pathways in complex environments
and stochastic interactions. Our comprehensive evaluation reveals that even the
most advanced models achieve only 30% success rate on cross-scenario tasks, and
less than 50% success rate on others. Overall, we believe VitaBench will serve
as a valuable resource for advancing the development of AI agents in practical
real-world applications. The code, dataset, and leaderboard are available at
https://vitabench.github.io/