Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks
[go: Go Back, main page]

https://arxivlens.com/PaperView/Details/venusbench-gd-a-comprehensive-multi-platform-gui-benchmark-for-diverse-grounding-tasks-1573-031154db

\n
    \n
  • Key Findings
  • \n
  • Executive Summary
  • \n
  • Detailed Breakdown
  • \n
  • Practical Applications
  • \n
\n","updatedAt":"2025-12-20T03:10:25.908Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.60209721326828},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2512.16501","authors":[{"_id":"6944c0f0fbf17e708e186028","name":"Beitong Zhou","hidden":false},{"_id":"6944c0f0fbf17e708e186029","name":"Zhexiao Huang","hidden":false},{"_id":"6944c0f0fbf17e708e18602a","name":"Yuan Guo","hidden":false},{"_id":"6944c0f0fbf17e708e18602b","user":{"_id":"60d2a2984956988b63753371","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d2a2984956988b63753371/apXIcWbi7jnLVH37CdMTV.jpeg","isPro":false,"fullname":"Zhangxuan Gu","user":"zhangxgu","type":"user"},"name":"Zhangxuan Gu","status":"claimed_verified","statusLastChangedAt":"2025-12-19T08:57:56.975Z","hidden":false},{"_id":"6944c0f0fbf17e708e18602c","name":"Tianyu Xia","hidden":false},{"_id":"6944c0f0fbf17e708e18602d","name":"Zichen Luo","hidden":false},{"_id":"6944c0f0fbf17e708e18602e","name":"Fei Tang","hidden":false},{"_id":"6944c0f0fbf17e708e18602f","name":"Dehan Kong","hidden":false},{"_id":"6944c0f0fbf17e708e186030","name":"Yanyi Shang","hidden":false},{"_id":"6944c0f0fbf17e708e186031","name":"Suling Ou","hidden":false},{"_id":"6944c0f0fbf17e708e186032","name":"Zhenlin Guo","hidden":false},{"_id":"6944c0f0fbf17e708e186033","name":"Changhua Meng","hidden":false},{"_id":"6944c0f0fbf17e708e186034","name":"Shuheng Shen","hidden":false}],"publishedAt":"2025-12-18T13:09:09.000Z","submittedOnDailyAt":"2025-12-19T03:40:54.482Z","title":"VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks","submittedOnDailyBy":{"_id":"60d2a2984956988b63753371","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d2a2984956988b63753371/apXIcWbi7jnLVH37CdMTV.jpeg","isPro":false,"fullname":"Zhangxuan Gu","user":"zhangxgu","type":"user"},"summary":"GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.","upvotes":10,"discussionId":"6944c0f0fbf17e708e186035","projectPage":"https://ui-venus.github.io/VenusBench-GD/","ai_summary":"A comprehensive, bilingual GUI grounding benchmark spanning multiple platforms is introduced, featuring a large-scale dataset with diverse UI elements and a hierarchical task taxonomy for evaluating models across basic and advanced grounding capabilities.","ai_keywords":["GUI grounding","multimodal models","cross-platform benchmark","hierarchical task taxonomy","element grounding","annotation accuracy","overfitting","robustness"],"organization":{"_id":"67aea5c8f086ab0f70ed97c9","name":"inclusionAI","fullname":"inclusionAI","avatar":"https://cdn-uploads.huggingface.co/production/uploads/662e1f9da266499277937d33/fyKuazRifqiaIO34xrhhm.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"60d2a2984956988b63753371","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d2a2984956988b63753371/apXIcWbi7jnLVH37CdMTV.jpeg","isPro":false,"fullname":"Zhangxuan Gu","user":"zhangxgu","type":"user"},{"_id":"66de6570c35391da430d1a7f","avatarUrl":"/avatars/13ab045cba90d62fdb10e18ab7e90ce3.svg","isPro":false,"fullname":"Yuan Guo","user":"huihuangbeihou","type":"user"},{"_id":"64cb238576200ec80fe988f8","avatarUrl":"/avatars/42c48710c7881c9dfbcc075fec3cb600.svg","isPro":false,"fullname":"zeus","user":"zengw","type":"user"},{"_id":"64e082d5e3d040e49599b102","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/SnnV97j00pIHFbUNiv6Ga.png","isPro":false,"fullname":"Xing-Ran Zhou","user":"Franc1sk","type":"user"},{"_id":"654c9dac09dd7ef524a0be1e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/654c9dac09dd7ef524a0be1e/T4glmZthS0mJydhvGZGKH.png","isPro":false,"fullname":"beitongzhou","user":"syorami","type":"user"},{"_id":"63fdf1556315a264aba75587","avatarUrl":"/avatars/acec74e9196e09cc722d371e9fb221c3.svg","isPro":false,"fullname":"ZhuoerXu","user":"Zhuoer","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"},{"_id":"689ead2a196ab997b13c699c","avatarUrl":"/avatars/6b268317b818d2f018b4ab187b32d9db.svg","isPro":false,"fullname":"YongYan","user":"Yan95","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"67aea5c8f086ab0f70ed97c9","name":"inclusionAI","fullname":"inclusionAI","avatar":"https://cdn-uploads.huggingface.co/production/uploads/662e1f9da266499277937d33/fyKuazRifqiaIO34xrhhm.jpeg"}}">
Papers
arxiv:2512.16501

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Published on Dec 18, 2025
· Submitted by
Zhangxuan Gu
on Dec 19, 2025
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

A comprehensive, bilingual GUI grounding benchmark spanning multiple platforms is introduced, featuring a large-scale dataset with diverse UI elements and a hierarchical task taxonomy for evaluating models across basic and advanced grounding capabilities.

AI-generated summary

GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.

Community

Paper author Paper submitter

GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.

arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/venusbench-gd-a-comprehensive-multi-platform-gui-benchmark-for-diverse-grounding-tasks-1573-031154db

  • Key Findings
  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.16501 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.16501 in a Space README.md to link it from this page.

Collections including this paper 1