Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - VLRewardBench: A Challenging Benchmark for Vision-Language Generative
Reward Models
https://huggingface.co/spaces/MMInstruction/VL-RewardBench\n","updatedAt":"2024-11-27T10:52:41.783Z","author":{"_id":"6038d6d0612f5eef3cc05ea9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6038d6d0612f5eef3cc05ea9/ryhvAX5djQpD5OrIlZQ1f.jpeg","fullname":"Lei Li","name":"tobiaslee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":24,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6125556230545044},"editors":["tobiaslee"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6038d6d0612f5eef3cc05ea9/ryhvAX5djQpD5OrIlZQ1f.jpeg"],"reactions":[],"isReport":false}},{"id":"6747c89970bb973bca41dd08","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-11-28T01:34:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment](https://huggingface.co/papers/2410.09421) (2024)\n* [MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models](https://huggingface.co/papers/2410.09733) (2024)\n* [LLaVA-Critic: Learning to Evaluate Multimodal Models](https://huggingface.co/papers/2410.02712) (2024)\n* [MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark](https://huggingface.co/papers/2410.11538) (2024)\n* [Vision-Language Models Can Self-Improve Reasoning via Reflection](https://huggingface.co/papers/2411.00855) (2024)\n* [MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models](https://huggingface.co/papers/2410.10139) (2024)\n* [MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark](https://huggingface.co/papers/2410.19168) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-11-28T01:34:17.448Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6833060383796692},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2411.17451","authors":[{"_id":"6746f9c205afcd8837d520b7","user":{"_id":"6038d6d0612f5eef3cc05ea9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6038d6d0612f5eef3cc05ea9/ryhvAX5djQpD5OrIlZQ1f.jpeg","isPro":false,"fullname":"Lei Li","user":"tobiaslee","type":"user"},"name":"Lei Li","status":"claimed_verified","statusLastChangedAt":"2024-12-09T14:47:54.952Z","hidden":false},{"_id":"6746f9c205afcd8837d520b8","name":"Yuancheng Wei","hidden":false},{"_id":"6746f9c205afcd8837d520b9","user":{"_id":"622f103fc78da4c7ebd7c887","avatarUrl":"/avatars/b0c7cd29835d92c2cd584947fcd5d520.svg","isPro":false,"fullname":"Xie","user":"Zhihui","type":"user"},"name":"Zhihui Xie","status":"claimed_verified","statusLastChangedAt":"2024-11-28T12:19:47.072Z","hidden":false},{"_id":"6746f9c205afcd8837d520ba","name":"Xuqing Yang","hidden":false},{"_id":"6746f9c205afcd8837d520bb","name":"Yifan Song","hidden":false},{"_id":"6746f9c205afcd8837d520bc","name":"Peiyi Wang","hidden":false},{"_id":"6746f9c205afcd8837d520bd","name":"Chenxin An","hidden":false},{"_id":"6746f9c205afcd8837d520be","name":"Tianyu Liu","hidden":false},{"_id":"6746f9c205afcd8837d520bf","name":"Sujian Li","hidden":false},{"_id":"6746f9c205afcd8837d520c0","name":"Bill Yuchen Lin","hidden":false},{"_id":"6746f9c205afcd8837d520c1","name":"Lingpeng Kong","hidden":false},{"_id":"6746f9c205afcd8837d520c2","name":"Qi Liu","hidden":false}],"publishedAt":"2024-11-26T14:08:34.000Z","submittedOnDailyAt":"2024-11-27T08:22:41.770Z","title":"VLRewardBench: A Challenging Benchmark for Vision-Language Generative\n Reward Models","submittedOnDailyBy":{"_id":"6038d6d0612f5eef3cc05ea9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6038d6d0612f5eef3cc05ea9/ryhvAX5djQpD5OrIlZQ1f.jpeg","isPro":false,"fullname":"Lei Li","user":"tobiaslee","type":"user"},"summary":"Vision-language generative reward models (VL-GenRMs) play a crucial role in\naligning and evaluating multimodal AI systems, yet their own evaluation remains\nunder-explored. Current assessment methods primarily rely on AI-annotated\npreference labels from traditional VL tasks, which can introduce biases and\noften fail to effectively challenge state-of-the-art models. To address these\nlimitations, we introduce VL-RewardBench, a comprehensive benchmark spanning\ngeneral multimodal queries, visual hallucination detection, and complex\nreasoning tasks. Through our AI-assisted annotation pipeline combining sample\nselection with human verification, we curate 1,250 high-quality examples\nspecifically designed to probe model limitations. Comprehensive evaluation\nacross 16 leading large vision-language models, demonstrates VL-RewardBench's\neffectiveness as a challenging testbed, where even GPT-4o achieves only 65.4%\naccuracy, and state-of-the-art open-source models such as Qwen2-VL-72B,\nstruggle to surpass random-guessing. Importantly, performance on VL-RewardBench\nstrongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N\nsampling with VL-GenRMs. Analysis experiments uncover three critical insights\nfor improving VL-GenRMs: (i) models predominantly fail at basic visual\nperception tasks rather than reasoning tasks; (ii) inference-time scaling\nbenefits vary dramatically by model capacity; and (iii) training VL-GenRMs to\nlearn to judge substantially boosts judgment capability (+14.7% accuracy for a\n7B VL-GenRM). We believe VL-RewardBench along with the experimental insights\nwill become a valuable resource for advancing VL-GenRMs.","upvotes":11,"discussionId":"6746f9c405afcd8837d52182","ai_summary":"VL-RewardBench is a comprehensive benchmark designed to challenge vision-language generative reward models across various tasks, demonstrating their limitations and providing insights for improvement.","ai_keywords":["vision-language generative reward models","multimodal AI systems","AI-annotated preference labels","visual hallucination detection","complex reasoning tasks","AI-assisted annotation pipeline","human verification","large vision-language models","GPT-4o","Qwen2-VL-72B","MMMU-Pro","Best-of-N sampling","inference-time scaling","training VL-GenRMs to learn to judge"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6038d6d0612f5eef3cc05ea9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6038d6d0612f5eef3cc05ea9/ryhvAX5djQpD5OrIlZQ1f.jpeg","isPro":false,"fullname":"Lei Li","user":"tobiaslee","type":"user"},{"_id":"622f103fc78da4c7ebd7c887","avatarUrl":"/avatars/b0c7cd29835d92c2cd584947fcd5d520.svg","isPro":false,"fullname":"Xie","user":"Zhihui","type":"user"},{"_id":"665d268988912c5ab6d332e4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665d268988912c5ab6d332e4/s8hf6aM4A8yLTi7AHz46A.jpeg","isPro":false,"fullname":"Xuqing Yang","user":"catalpa-bungei","type":"user"},{"_id":"63f37af60be81bdc5d92eebb","avatarUrl":"/avatars/b8dfdff4ab36988ec9a8643e82a3d2db.svg","isPro":false,"fullname":"Huang","user":"Jinfa","type":"user"},{"_id":"63b2a92e18e5cf2cdd333492","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b2a92e18e5cf2cdd333492/GxnngJG0u7d0jYTEFOrfe.png","isPro":false,"fullname":"Jaehyun Jun","user":"btjhjeon","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"641b754d1911d3be6745cce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641b754d1911d3be6745cce9/Ydjcjd4VuNUGj5Cd4QHdB.png","isPro":false,"fullname":"atayloraerospace","user":"Taylor658","type":"user"},{"_id":"65decc75beffeb39ba679eba","avatarUrl":"/avatars/735b678bd5863a0c1b1bdd3bbf8858fa.svg","isPro":true,"fullname":"r","user":"oceansweep","type":"user"},{"_id":"64d4615cf8082bf19b916492","avatarUrl":"/avatars/8e1b59565ec5e4b31090cf1b911781b9.svg","isPro":false,"fullname":"wongyukim","user":"wongyukim","type":"user"},{"_id":"637aebed7ce76c3b834cea37","avatarUrl":"/avatars/78d6dd02d900e4a4b4fd89776b01f4fe.svg","isPro":false,"fullname":"RainingXY","user":"xxyyy123","type":"user"},{"_id":"67b19f81615a3737b5772b3b","avatarUrl":"/avatars/2ca4c53715e82a7e17c4b631eaf34042.svg","isPro":false,"fullname":"Fenzhif","user":"FANCERTA","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
VL-RewardBench is a comprehensive benchmark designed to challenge vision-language generative reward models across various tasks, demonstrating their limitations and providing insights for improvement.
AI-generated summary
Vision-language generative reward models (VL-GenRMs) play a crucial role in
aligning and evaluating multimodal AI systems, yet their own evaluation remains
under-explored. Current assessment methods primarily rely on AI-annotated
preference labels from traditional VL tasks, which can introduce biases and
often fail to effectively challenge state-of-the-art models. To address these
limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning
general multimodal queries, visual hallucination detection, and complex
reasoning tasks. Through our AI-assisted annotation pipeline combining sample
selection with human verification, we curate 1,250 high-quality examples
specifically designed to probe model limitations. Comprehensive evaluation
across 16 leading large vision-language models, demonstrates VL-RewardBench's
effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4%
accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B,
struggle to surpass random-guessing. Importantly, performance on VL-RewardBench
strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N
sampling with VL-GenRMs. Analysis experiments uncover three critical insights
for improving VL-GenRMs: (i) models predominantly fail at basic visual
perception tasks rather than reasoning tasks; (ii) inference-time scaling
benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to
learn to judge substantially boosts judgment capability (+14.7% accuracy for a
7B VL-GenRM). We believe VL-RewardBench along with the experimental insights
will become a valuable resource for advancing VL-GenRMs.