Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
[go: Go Back, main page]

https://arxivlens.com/PaperView/Details/multimodal-rewardbench-2-evaluating-omni-reward-models-for-interleaved-text-and-image-4504-324ab751

\n
    \n
  • Key Findings
  • \n
  • Executive Summary
  • \n
  • Detailed Breakdown
  • \n
  • Practical Applications
  • \n
\n","updatedAt":"2025-12-20T03:08:56.615Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6059150695800781},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2512.16899","authors":[{"_id":"6944cae4fbf17e708e1860b7","name":"Yushi Hu","hidden":false},{"_id":"6944cae4fbf17e708e1860b8","name":"Reyhane Askari-Hemmat","hidden":false},{"_id":"6944cae4fbf17e708e1860b9","name":"Melissa Hall","hidden":false},{"_id":"6944cae4fbf17e708e1860ba","name":"Emily Dinan","hidden":false},{"_id":"6944cae4fbf17e708e1860bb","name":"Luke Zettlemoyer","hidden":false},{"_id":"6944cae4fbf17e708e1860bc","name":"Marjan Ghazvininejad","hidden":false}],"publishedAt":"2025-12-18T18:56:04.000Z","submittedOnDailyAt":"2025-12-19T07:31:45.856Z","title":"Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image","submittedOnDailyBy":{"_id":"62b1474bdcbad6848a91a54e","avatarUrl":"/avatars/d7308899b46232cad4a48a0e876449a8.svg","isPro":false,"fullname":"Yushi Hu","user":"yushihu","type":"user"},"summary":"Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning (\"thinking-with-images\"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.","upvotes":14,"discussionId":"6944cae5fbf17e708e1860bd","githubRepo":"https://github.com/facebookresearch/MMRB2","githubRepoAddedBy":"user","ai_summary":"Multimodal RewardBench 2 (MMRB2) is a benchmark for reward models on multimodal understanding and generation tasks, featuring expert-annotated preferences and state-of-the-art model evaluations.","ai_keywords":["reward models","large language models","omni models","Multimodal RewardBench 2","MMRB2","text-to-image","image editing","interleaved generation","multimodal reasoning","multimodal LLM-as-a-judge","Best-of-N sampling"],"githubStars":136,"organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1592839207516-noauth.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62b1474bdcbad6848a91a54e","avatarUrl":"/avatars/d7308899b46232cad4a48a0e876449a8.svg","isPro":false,"fullname":"Yushi Hu","user":"yushihu","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"69452684468cf717f3908ee1","avatarUrl":"/avatars/86cc9aa3dd425eb4711e7b8b85d236a0.svg","isPro":false,"fullname":"Aaron Yau","user":"Shuozai128","type":"user"},{"_id":"69452e6ca7d6c5cb77f23114","avatarUrl":"/avatars/75e185129299a58362789203a832357e.svg","isPro":false,"fullname":"Elizabeth Zhu","user":"daizhu25","type":"user"},{"_id":"6697bfb4acbced03ebd133f8","avatarUrl":"/avatars/a7d41367c6f7974303feb5af2a387ce4.svg","isPro":false,"fullname":"Melissa Hall","user":"melissah14","type":"user"},{"_id":"660f0fd377a1e2509aa5a679","avatarUrl":"/avatars/e04ef05bed0bf6cefdc7e3e39674e2f9.svg","isPro":false,"fullname":"Marjan Ghazvininejad","user":"mghazvininejad","type":"user"},{"_id":"68adf6b6e16ad4fe5bfade64","avatarUrl":"/avatars/d5859a4f28c2e209af792dd49b4aa84f.svg","isPro":false,"fullname":"As","user":"Rey9999999","type":"user"},{"_id":"67f457ef1e4936f014965dae","avatarUrl":"/avatars/6f50d62c6366f8c39880567d25527b61.svg","isPro":false,"fullname":"Mo Pez","user":"mohammadpz","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"6575171654d1749612e21eed","avatarUrl":"/avatars/c032c1b942b3cb9450a49db88fce5c70.svg","isPro":false,"fullname":"Yulai Zhao","user":"sarosavo","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1592839207516-noauth.png"}}">
Papers
arxiv:2512.16899

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Published on Dec 18, 2025
· Submitted by
Yushi Hu
on Dec 19, 2025
Authors:
,
,
,
,
,

Abstract

Multimodal RewardBench 2 (MMRB2) is a benchmark for reward models on multimodal understanding and generation tasks, featuring expert-annotated preferences and state-of-the-art model evaluations.

AI-generated summary

Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.

Community

Paper submitter

Reward models are the most critical part of post-training for Omni models like nano banana, but it is barely studied in the open-source world. To build the foundation for future research on better post-training and RL for Omni models, FAIR at Meta Superintelligence Labs released their reward benchmark.

arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/multimodal-rewardbench-2-evaluating-omni-reward-models-for-interleaved-text-and-image-4504-324ab751

  • Key Findings
  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.16899 in a Space README.md to link it from this page.

Collections including this paper 2