Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - VisualAgentBench: Towards Large Multimodal Models as Visual Foundation
Agents
https://github.com/THUDM/VisualAgentBench\n","updatedAt":"2024-08-13T03:43:34.845Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9179,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.35366788506507874},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"66bae41f3509ae3eab880e3f","author":{"_id":"62f098602ca4d32a7cd87aba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1659934844914-62f098602ca4d32a7cd87aba.jpeg","fullname":"Xiao Liu","name":"ShawLiu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":23,"isUserFollowing":false},"createdAt":"2024-08-13T04:42:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"\nhttps://cdn-uploads.huggingface.co/production/uploads/62f098602ca4d32a7cd87aba/qHmZwlssB4gbKDDvlZofw.mp4\n","html":"\n","updatedAt":"2024-08-13T04:42:07.645Z","author":{"_id":"62f098602ca4d32a7cd87aba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1659934844914-62f098602ca4d32a7cd87aba.jpeg","fullname":"Xiao Liu","name":"ShawLiu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":23,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5004839897155762},"editors":["ShawLiu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1659934844914-62f098602ca4d32a7cd87aba.jpeg"],"reactions":[],"isReport":false}},{"id":"66bc092171cc405bc4f05ae6","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-08-14T01:32:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments](https://huggingface.co/papers/2406.16294) (2024)\n* [CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents](https://huggingface.co/papers/2407.01511) (2024)\n* [GUICourse: From General Vision Language Models to Versatile GUI Agents](https://huggingface.co/papers/2406.11317) (2024)\n* [ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights](https://huggingface.co/papers/2406.14596) (2024)\n* [AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation](https://huggingface.co/papers/2408.00764) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-08-14T01:32:17.756Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7725917100906372},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2408.06327","authors":[{"_id":"66bad6638a3c4b82368bdea8","user":{"_id":"62f098602ca4d32a7cd87aba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1659934844914-62f098602ca4d32a7cd87aba.jpeg","isPro":false,"fullname":"Xiao Liu","user":"ShawLiu","type":"user"},"name":"Xiao Liu","status":"admin_assigned","statusLastChangedAt":"2024-08-13T09:33:02.305Z","hidden":false},{"_id":"66bad6638a3c4b82368bdea9","user":{"_id":"64dceda0f44d2407e60179f8","avatarUrl":"/avatars/4c5679667d5f7c4c7f1ac1c082da259a.svg","isPro":false,"fullname":"Tianjie Zhang","user":"tianjiezhang","type":"user"},"name":"Tianjie Zhang","status":"claimed_verified","statusLastChangedAt":"2024-08-13T08:27:03.812Z","hidden":false},{"_id":"66bad6638a3c4b82368bdeaa","user":{"_id":"5e7e595230dc073f817a2bb5","avatarUrl":"/avatars/d5ff36e45555d9e169cf56c845736444.svg","isPro":false,"fullname":"Yu Gu","user":"entslscheia","type":"user"},"name":"Yu Gu","status":"claimed_verified","statusLastChangedAt":"2024-10-14T14:08:32.411Z","hidden":false},{"_id":"66bad6638a3c4b82368bdeab","name":"Iat Long Iong","hidden":false},{"_id":"66bad6638a3c4b82368bdeac","user":{"_id":"62dc23c9d36b2070f923f585","avatarUrl":"/avatars/ed10f5a1d6e977700cd4022b6f96e3ff.svg","isPro":false,"fullname":"Xu Yifan","user":"xuyifan","type":"user"},"name":"Yifan Xu","status":"claimed_verified","statusLastChangedAt":"2025-10-27T10:37:15.199Z","hidden":false},{"_id":"66bad6638a3c4b82368bdead","user":{"_id":"63f36b504745321de3510823","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f36b504745321de3510823/WyeK9c_ck6mBEq-c94zV_.jpeg","isPro":false,"fullname":"Song XiXuan","user":"DrSong","type":"user"},"name":"Xixuan Song","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:56:52.734Z","hidden":false},{"_id":"66bad6638a3c4b82368bdeae","user":{"_id":"637883f3736e2989331da910","avatarUrl":"/avatars/f72b579724d2d43889907e8b732e79e6.svg","isPro":false,"fullname":"Shudan Zhang","user":"zdaniel0222","type":"user"},"name":"Shudan Zhang","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:57:02.469Z","hidden":false},{"_id":"66bad6638a3c4b82368bdeaf","user":{"_id":"62ea2d48b493272c269e8f34","avatarUrl":"/avatars/89b215dafa503b51ab212a9b63c82aca.svg","isPro":false,"fullname":"Hanyu Lai","user":"hanyullai","type":"user"},"name":"Hanyu Lai","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:57:08.769Z","hidden":false},{"_id":"66bad6638a3c4b82368bdeb0","name":"Xinyi Liu","hidden":false},{"_id":"66bad6638a3c4b82368bdeb1","user":{"_id":"647aa6121eff17116fa61213","avatarUrl":"/avatars/0735b857db37e59018fb5aa1dd231000.svg","isPro":false,"fullname":"zhaohanlin","user":"ultrazhl","type":"user"},"name":"Hanlin Zhao","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:58:12.909Z","hidden":false},{"_id":"66bad6638a3c4b82368bdeb2","user":{"_id":"64e82f2c9a3cd93b370e9936","avatarUrl":"/avatars/942e6008e8e78583c2750d678fcaadb1.svg","isPro":false,"fullname":"jiadaisun","user":"jiadaisun","type":"user"},"name":"Jiadai Sun","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:58:20.293Z","hidden":false},{"_id":"66bad6638a3c4b82368bdeb3","name":"Xinyue Yang","hidden":false},{"_id":"66bad6638a3c4b82368bdeb4","user":{"_id":"60a53adbf9b53404e7806277","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1621441193991-noauth.jpeg","isPro":false,"fullname":"Yu Yang","user":"yuyang","type":"user"},"name":"Yu Yang","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:58:49.143Z","hidden":false},{"_id":"66bad6638a3c4b82368bdeb5","user":{"_id":"650ace9fa80a948447812fc5","avatarUrl":"/avatars/e01fe0d973008f5e3cd3c99f895b5946.svg","isPro":false,"fullname":"zehanqi","user":"zehanqi","type":"user"},"name":"Zehan Qi","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:58:56.766Z","hidden":false},{"_id":"66bad6638a3c4b82368bdeb6","user":{"_id":"6464c91ef4f3102bb6c7418a","avatarUrl":"/avatars/1e6ce129d8cd51e81d517e23ad9aee43.svg","isPro":false,"fullname":"Shuntian Yao","user":"TMIP","type":"user"},"name":"Shuntian Yao","status":"admin_assigned","statusLastChangedAt":"2024-08-13T08:59:04.146Z","hidden":false},{"_id":"66bad6638a3c4b82368bdeb7","name":"Xueqiao Sun","hidden":false},{"_id":"66bad6638a3c4b82368bdeb8","user":{"_id":"654cf05258c9d74272181118","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Dfx-hp3RLkP95gQtOM7T0.png","isPro":false,"fullname":"Siyi Cheng","user":"LeakyKayO","type":"user"},"name":"Siyi Cheng","status":"admin_assigned","statusLastChangedAt":"2024-08-13T09:06:20.918Z","hidden":false},{"_id":"66bad6638a3c4b82368bdeb9","user":{"_id":"6231576e92e83fd1179ac3f0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1664543160657-6231576e92e83fd1179ac3f0.jpeg","isPro":false,"fullname":"Qinkai Zheng","user":"Stanislas","type":"user"},"name":"Qinkai Zheng","status":"admin_assigned","statusLastChangedAt":"2024-08-13T09:06:27.960Z","hidden":false},{"_id":"66bad6638a3c4b82368bdeba","user":{"_id":"669205f1ccca14aa8f13f770","avatarUrl":"/avatars/11ce274e93345fe3790ac9fa687e2bcb.svg","isPro":false,"fullname":"Hao Yu","user":"Longin-Yu","type":"user"},"name":"Hao Yu","status":"claimed_verified","statusLastChangedAt":"2025-11-26T09:29:53.353Z","hidden":false},{"_id":"66bad6638a3c4b82368bdebb","name":"Hanchen Zhang","hidden":false},{"_id":"66bad6638a3c4b82368bdebc","user":{"_id":"62ecd24cb8764c7738ef2793","avatarUrl":"/avatars/c1b80b5c55f9d652c1aaac7919e1fa32.svg","isPro":false,"fullname":"Wenyi Hong","user":"wenyi","type":"user"},"name":"Wenyi Hong","status":"admin_assigned","statusLastChangedAt":"2024-08-13T09:09:28.794Z","hidden":false},{"_id":"66bad6638a3c4b82368bdebd","name":"Ming Ding","hidden":false},{"_id":"66bad6638a3c4b82368bdebe","user":{"_id":"6401838752fb66b80d209207","avatarUrl":"/avatars/c4da78b1fa53769ca9132528924b88b0.svg","isPro":false,"fullname":"Li hang pan","user":"k789456789040","type":"user"},"name":"Lihang Pan","status":"admin_assigned","statusLastChangedAt":"2024-08-13T09:09:38.155Z","hidden":false},{"_id":"66bad6638a3c4b82368bdebf","user":{"_id":"6438bca4a5e10f6d58694b47","avatarUrl":"/avatars/3aeb25fbc73c5cab1265e13d11adfb76.svg","isPro":false,"fullname":"Xiaotao Gu","user":"xgeric","type":"user"},"name":"Xiaotao Gu","status":"claimed_verified","statusLastChangedAt":"2025-07-02T09:25:09.903Z","hidden":false},{"_id":"66bad6638a3c4b82368bdec0","user":{"_id":"62dc173789b4cf157d36ebee","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1659772051636-62dc173789b4cf157d36ebee.jpeg","isPro":false,"fullname":"Zeng Aohan","user":"Sengxian","type":"user"},"name":"Aohan Zeng","status":"admin_assigned","statusLastChangedAt":"2024-08-13T09:10:03.039Z","hidden":false},{"_id":"66bad6638a3c4b82368bdec1","user":{"_id":"63033dc4e1e7f0e03a5e1a31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661157784937-63033dc4e1e7f0e03a5e1a31.jpeg","isPro":false,"fullname":"Zhengxiao Du","user":"zxdu20","type":"user"},"name":"Zhengxiao Du","status":"admin_assigned","statusLastChangedAt":"2024-08-13T09:10:10.310Z","hidden":false},{"_id":"66bad6638a3c4b82368bdec2","user":{"_id":"63d19365b30415240fd6515b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d19365b30415240fd6515b/eOEYSsyDTfPTDrR6Cm5Jn.jpeg","isPro":false,"fullname":"Chan Hee Song","user":"chanhee-luke","type":"user"},"name":"Chan Hee Song","status":"admin_assigned","statusLastChangedAt":"2024-08-13T09:10:17.970Z","hidden":false},{"_id":"66bad6638a3c4b82368bdec3","user":{"_id":"6477a323dbc2a416f8b852b3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6477a323dbc2a416f8b852b3/mRKW5kT9GASORT4YnaZz0.jpeg","isPro":false,"fullname":"Yu Su","user":"ysu-nlp","type":"user"},"name":"Yu Su","status":"claimed_verified","statusLastChangedAt":"2024-10-14T19:21:57.847Z","hidden":false},{"_id":"66bad6638a3c4b82368bdec4","user":{"_id":"640e73bdfdeaae1390857b62","avatarUrl":"/avatars/cd6779e30f716002a7838ed93d5c0754.svg","isPro":false,"fullname":"Yuxiao Dong","user":"yuxiaod","type":"user"},"name":"Yuxiao Dong","status":"admin_assigned","statusLastChangedAt":"2024-08-13T09:10:26.615Z","hidden":false},{"_id":"66bad6638a3c4b82368bdec5","user":{"_id":"640dff05474aa6f89556677e","avatarUrl":"/avatars/1b4591c7322d649c797b3125148f1915.svg","isPro":false,"fullname":"Jie Tang","user":"jerytang","type":"user"},"name":"Jie Tang","status":"admin_assigned","statusLastChangedAt":"2024-08-13T09:10:34.600Z","hidden":false}],"publishedAt":"2024-08-12T17:44:17.000Z","submittedOnDailyAt":"2024-08-13T02:13:34.841Z","title":"VisualAgentBench: Towards Large Multimodal Models as Visual Foundation\n Agents","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Large Multimodal Models (LMMs) have ushered in a new era in artificial\nintelligence, merging capabilities in both language and vision to form highly\ncapable Visual Foundation Agents. These agents are postulated to excel across a\nmyriad of tasks, potentially approaching general artificial intelligence.\nHowever, existing benchmarks fail to sufficiently challenge or showcase the\nfull potential of LMMs in complex, real-world environments. To address this\ngap, we introduce VisualAgentBench (VAB), a comprehensive and pioneering\nbenchmark specifically designed to train and evaluate LMMs as visual foundation\nagents across diverse scenarios, including Embodied, Graphical User Interface,\nand Visual Design, with tasks formulated to probe the depth of LMMs'\nunderstanding and interaction capabilities. Through rigorous testing across\nnine proprietary LMM APIs and eight open models, we demonstrate the\nconsiderable yet still developing agent capabilities of these models.\nAdditionally, VAB constructs a trajectory training set constructed through\nhybrid methods including Program-based Solvers, LMM Agent Bootstrapping, and\nHuman Demonstrations, promoting substantial performance improvements in LMMs\nthrough behavior cloning. Our work not only aims to benchmark existing models\nbut also provides a solid foundation for future development into visual\nfoundation agents. Code, train \\& test data, and part of fine-tuned open LMMs\nare available at https://github.com/THUDM/VisualAgentBench.","upvotes":17,"discussionId":"66bad6658a3c4b82368bdf27","githubRepo":"https://github.com/thudm/visualagentbench","githubRepoAddedBy":"auto","ai_summary":"VisualAgentBench (VAB) is a benchmark designed to evaluate Large Multimodal Models (LMMs) across various real-world scenarios, including Embodied, Graphical User Interface, and Visual Design tasks, to assess their understanding and interaction capabilities.","ai_keywords":["Large Multimodal Models","Visual Foundation Agents","VisualAgentBench","Embodied","Graphical User Interface","Visual Design","Program-based Solvers","LMM Agent Bootstrapping","Human Demonstrations","behavior cloning"],"githubStars":256},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"62f098602ca4d32a7cd87aba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1659934844914-62f098602ca4d32a7cd87aba.jpeg","isPro":false,"fullname":"Xiao Liu","user":"ShawLiu","type":"user"},{"_id":"5e7e595230dc073f817a2bb5","avatarUrl":"/avatars/d5ff36e45555d9e169cf56c845736444.svg","isPro":false,"fullname":"Yu Gu","user":"entslscheia","type":"user"},{"_id":"65683c6efd8939c271447e9b","avatarUrl":"/avatars/e4a97ba3550ddcb24b906a7d4c87c861.svg","isPro":false,"fullname":"ZihaoZhou","user":"ZihaoZhou","type":"user"},{"_id":"64d9f63f716e2530a0e19b8d","avatarUrl":"/avatars/1c5fdd03f692c131c18a5ab9fd6daa52.svg","isPro":false,"fullname":"Alex R","user":"alromb","type":"user"},{"_id":"64dceda0f44d2407e60179f8","avatarUrl":"/avatars/4c5679667d5f7c4c7f1ac1c082da259a.svg","isPro":false,"fullname":"Tianjie Zhang","user":"tianjiezhang","type":"user"},{"_id":"651c80a26ba9ab9b9582c273","avatarUrl":"/avatars/e963452eafd21f517d800f2e58e0f918.svg","isPro":false,"fullname":"siyeng feng","user":"siyengfeng","type":"user"},{"_id":"62716952bcef985363db8485","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62716952bcef985363db8485/zJPPo5xlwZRJdEuwYsYKp.jpeg","isPro":true,"fullname":"JB D.","user":"IAMJB","type":"user"},{"_id":"647aa6121eff17116fa61213","avatarUrl":"/avatars/0735b857db37e59018fb5aa1dd231000.svg","isPro":false,"fullname":"zhaohanlin","user":"ultrazhl","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
VisualAgentBench (VAB) is a benchmark designed to evaluate Large Multimodal Models (LMMs) across various real-world scenarios, including Embodied, Graphical User Interface, and Visual Design tasks, to assess their understanding and interaction capabilities.
AI-generated summary
Large Multimodal Models (LMMs) have ushered in a new era in artificial
intelligence, merging capabilities in both language and vision to form highly
capable Visual Foundation Agents. These agents are postulated to excel across a
myriad of tasks, potentially approaching general artificial intelligence.
However, existing benchmarks fail to sufficiently challenge or showcase the
full potential of LMMs in complex, real-world environments. To address this
gap, we introduce VisualAgentBench (VAB), a comprehensive and pioneering
benchmark specifically designed to train and evaluate LMMs as visual foundation
agents across diverse scenarios, including Embodied, Graphical User Interface,
and Visual Design, with tasks formulated to probe the depth of LMMs'
understanding and interaction capabilities. Through rigorous testing across
nine proprietary LMM APIs and eight open models, we demonstrate the
considerable yet still developing agent capabilities of these models.
Additionally, VAB constructs a trajectory training set constructed through
hybrid methods including Program-based Solvers, LMM Agent Bootstrapping, and
Human Demonstrations, promoting substantial performance improvements in LMMs
through behavior cloning. Our work not only aims to benchmark existing models
but also provides a solid foundation for future development into visual
foundation agents. Code, train \& test data, and part of fine-tuned open LMMs
are available at https://github.com/THUDM/VisualAgentBench.