Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - ViLBench: A Suite for Vision-Language Process Reward Modeling
[go: Go Back, main page]

https://ucsc-vlaa.github.io/ViLBench/

\n","updatedAt":"2025-03-27T07:33:39.778Z","author":{"_id":"604ae011caabafacfa48e3de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1615519738679-noauth.jpeg","fullname":"Haoqin Tu","name":"PahaII","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.24655406177043915},"editors":["PahaII"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1615519738679-noauth.jpeg"],"reactions":[],"isReport":false}},{"id":"67e5fc923f41353ad4c1612b","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-03-28T01:34:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models](https://huggingface.co/papers/2502.14191) (2025)\n* [Demystifying Multilingual Chain-of-Thought in Process Reward Modeling](https://huggingface.co/papers/2502.12663) (2025)\n* [MM-RLHF: The Next Step Forward in Multimodal LLM Alignment](https://huggingface.co/papers/2502.10391) (2025)\n* [Process-based Self-Rewarding Language Models](https://huggingface.co/papers/2503.03746) (2025)\n* [Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models](https://huggingface.co/papers/2501.18533) (2025)\n* [Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning](https://huggingface.co/papers/2502.14356) (2025)\n* [R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization](https://huggingface.co/papers/2503.12937) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-03-28T01:34:10.867Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7280504107475281},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.20271","authors":[{"_id":"67e4dc6a38e4d1444c71ce70","user":{"_id":"604ae011caabafacfa48e3de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1615519738679-noauth.jpeg","isPro":false,"fullname":"Haoqin Tu","user":"PahaII","type":"user"},"name":"Haoqin Tu","status":"admin_assigned","statusLastChangedAt":"2025-03-27T10:17:04.325Z","hidden":false},{"_id":"67e4dc6a38e4d1444c71ce71","user":{"_id":"6625e4e960286184cd2a3cc8","avatarUrl":"/avatars/894c4ba1dbf776c5db62e03cde07f9f8.svg","isPro":false,"fullname":"Weitao Feng","user":"Helicopt","type":"user"},"name":"Weitao Feng","status":"admin_assigned","statusLastChangedAt":"2025-03-27T10:17:11.346Z","hidden":false},{"_id":"67e4dc6a38e4d1444c71ce72","user":{"_id":"65ca20731082be5463d0945c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/xpySHGNH7_yyH58gUyvvZ.jpeg","isPro":false,"fullname":"Hardy Chen","user":"alihiker","type":"user"},"name":"Hardy Chen","status":"admin_assigned","statusLastChangedAt":"2025-03-27T10:17:20.107Z","hidden":false},{"_id":"67e4dc6a38e4d1444c71ce73","name":"Hui Liu","hidden":false},{"_id":"67e4dc6a38e4d1444c71ce74","user":{"_id":"6465f6467ff8fcbef7d22513","avatarUrl":"/avatars/07992835c235fbb07016a0ea4f1d61cb.svg","isPro":false,"fullname":"Xianfeng Tang","user":"xianft","type":"user"},"name":"Xianfeng Tang","status":"admin_assigned","statusLastChangedAt":"2025-03-27T10:17:40.389Z","hidden":false},{"_id":"67e4dc6a38e4d1444c71ce75","user":{"_id":"645eb61da3c5cd8a16efffff","avatarUrl":"/avatars/9112bfeed598dfabf9e077e69e09ecc9.svg","isPro":false,"fullname":"Cihang Xie","user":"cihangxie","type":"user"},"name":"Cihang Xie","status":"admin_assigned","statusLastChangedAt":"2025-03-27T10:17:34.681Z","hidden":false}],"publishedAt":"2025-03-26T06:38:31.000Z","submittedOnDailyAt":"2025-03-27T06:03:39.751Z","title":"ViLBench: A Suite for Vision-Language Process Reward Modeling","submittedOnDailyBy":{"_id":"604ae011caabafacfa48e3de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1615519738679-noauth.jpeg","isPro":false,"fullname":"Haoqin Tu","user":"PahaII","type":"user"},"summary":"Process-supervised reward models serve as a fine-grained function that\nprovides detailed step-wise feedback to model responses, facilitating effective\nselection of reasoning trajectories for complex tasks. Despite its advantages,\nevaluation on PRMs remains less explored, especially in the multimodal domain.\nTo address this gap, this paper first benchmarks current vision large language\nmodels (VLLMs) as two types of reward models: output reward models (ORMs) and\nprocess reward models (PRMs) on multiple vision-language benchmarks, which\nreveal that neither ORM nor PRM consistently outperforms across all tasks, and\nsuperior VLLMs do not necessarily yield better rewarding performance. To\nfurther advance evaluation, we introduce ViLBench, a vision-language benchmark\ndesigned to require intensive process reward signals. Notably, OpenAI's GPT-4o\nwith Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the\nbenchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a\npromising pathway towards bridging the gap between general VLLMs and reward\nmodels -- by collecting 73.6K vision-language process reward data using an\nenhanced tree-search algorithm, our 3B model is able to achieve an average\nimprovement of 3.3% over standard CoT and up to 2.5% compared to its untrained\ncounterpart on ViLBench by selecting OpenAI o1's generations. We release the\nimplementations at https://ucsc-vlaa.github.io/ViLBench with our code, model,\nand data.","upvotes":7,"discussionId":"67e4dc6b38e4d1444c71cee2","projectPage":"https://ucsc-vlaa.github.io/ViLBench/","ai_summary":"Evaluation of process-supervised reward models on vision-language benchmarks reveals variability in performance and introduces a challenging benchmark, ViLBench, where current models show significant room for improvement.","ai_keywords":["process-supervised reward models","VLLMs","output reward models","process reward models","vision-language benchmarks","Chain-of-Thought","ViLBench"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"604ae011caabafacfa48e3de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1615519738679-noauth.jpeg","isPro":false,"fullname":"Haoqin Tu","user":"PahaII","type":"user"},{"_id":"63b2a92e18e5cf2cdd333492","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b2a92e18e5cf2cdd333492/GxnngJG0u7d0jYTEFOrfe.png","isPro":false,"fullname":"Jaehyun Jun","user":"btjhjeon","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"645eb61da3c5cd8a16efffff","avatarUrl":"/avatars/9112bfeed598dfabf9e077e69e09ecc9.svg","isPro":false,"fullname":"Cihang Xie","user":"cihangxie","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2503.20271

ViLBench: A Suite for Vision-Language Process Reward Modeling

Published on Mar 26, 2025
· Submitted by
Haoqin Tu
on Mar 27, 2025

Abstract

Evaluation of process-supervised reward models on vision-language benchmarks reveals variability in performance and introduces a challenging benchmark, ViLBench, where current models show significant room for improvement.

AI-generated summary

Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at https://ucsc-vlaa.github.io/ViLBench with our code, model, and data.

Community

Paper author Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.20271 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.20271 in a Space README.md to link it from this page.

Collections including this paper 6