Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Training Data Efficiency in Multimodal Process Reward Models
Balanced-Info-MPRM.\n","updatedAt":"2026-02-05T03:51:27.352Z","author":{"_id":"65e02d89574e5aa0e9ce3efa","avatarUrl":"/avatars/2ab152a10b21d81fb1defc726b8e951a.svg","fullname":"Langlin Huang","name":"shrango","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8639266490936279},"editors":["shrango"],"editorAvatarUrls":["/avatars/2ab152a10b21d81fb1defc726b8e951a.svg"],"reactions":[],"isReport":false}},{"id":"698546c8072f42ca535fb8f7","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-06T01:41:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Towards Robust Process Reward Modeling via Noise-aware Learning](https://huggingface.co/papers/2601.12748) (2026)\n* [Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning](https://huggingface.co/papers/2601.18984) (2026)\n* [Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing](https://huggingface.co/papers/2602.03452) (2026)\n* [Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization](https://huggingface.co/papers/2601.04992) (2026)\n* [Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning](https://huggingface.co/papers/2601.21804) (2026)\n* [Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement](https://huggingface.co/papers/2602.00815) (2026)\n* [iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models](https://huggingface.co/papers/2601.05877) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-02-06T01:41:28.563Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7096429467201233},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"69874bf72b5f695708321380","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-02-07T14:28:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivLens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/training-data-efficiency-in-multimodal-process-reward-models-5868-b7d47c63\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"
\n","updatedAt":"2026-02-07T14:28:07.169Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7223247289657593},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.04145","authors":[{"_id":"698412bbe34659da7e1f4e04","user":{"_id":"64fc20d899123d7698a30e61","avatarUrl":"/avatars/9231982cf70a0689f50accedf1004702.svg","isPro":false,"fullname":"Jinyuan Li","user":"jinyuan222","type":"user"},"name":"Jinyuan Li","status":"claimed_verified","statusLastChangedAt":"2026-02-05T10:53:57.944Z","hidden":false},{"_id":"698412bbe34659da7e1f4e05","user":{"_id":"62ea79dd01ed9b0e8f61ccd3","avatarUrl":"/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg","isPro":false,"fullname":"Chengsong Huang","user":"ChengsongHuang","type":"user"},"name":"Chengsong Huang","status":"claimed_verified","statusLastChangedAt":"2026-02-06T18:55:50.153Z","hidden":false},{"_id":"698412bbe34659da7e1f4e06","name":"Langlin Huang","hidden":false},{"_id":"698412bbe34659da7e1f4e07","name":"Shaoyang Xu","hidden":false},{"_id":"698412bbe34659da7e1f4e08","name":"Haolin Liu","hidden":false},{"_id":"698412bbe34659da7e1f4e09","name":"Wenxuan Zhang","hidden":false},{"_id":"698412bbe34659da7e1f4e0a","name":"Jiaxin Huang","hidden":false}],"publishedAt":"2026-02-04T02:27:38.000Z","submittedOnDailyAt":"2026-02-05T01:21:27.343Z","title":"Training Data Efficiency in Multimodal Process Reward Models","submittedOnDailyBy":{"_id":"65e02d89574e5aa0e9ce3efa","avatarUrl":"/avatars/2ab152a10b21d81fb1defc726b8e951a.svg","isPro":false,"fullname":"Langlin Huang","user":"shrango","type":"user"},"summary":"Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training.Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora.To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.","upvotes":76,"discussionId":"698412bbe34659da7e1f4e0b","ai_summary":"Training multimodal process reward models efficiently through balanced-information scoring that prioritizes label mixture and reliability while achieving full-data performance with only 10% of training data.","ai_keywords":["Multimodal Process Reward Models","Monte Carlo-annotated corpora","VisualProcessBench","Balanced-Information Score","label mixtures","label reliability","gradient updates","data efficiency"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65e02d89574e5aa0e9ce3efa","avatarUrl":"/avatars/2ab152a10b21d81fb1defc726b8e951a.svg","isPro":false,"fullname":"Langlin Huang","user":"shrango","type":"user"},{"_id":"62ea79dd01ed9b0e8f61ccd3","avatarUrl":"/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg","isPro":false,"fullname":"Chengsong Huang","user":"ChengsongHuang","type":"user"},{"_id":"6984153a84dfa06decce1878","avatarUrl":"/avatars/84dd72d402a5e95941149da896defb4b.svg","isPro":false,"fullname":"Han Li","user":"hlihan","type":"user"},{"_id":"6850ab3fb73a72cdc6159f6f","avatarUrl":"/avatars/74c58336655cdfe7d3fb9854d426240d.svg","isPro":false,"fullname":"Haolin Liu","user":"lhl616","type":"user"},{"_id":"698418683a080cd28734b0f9","avatarUrl":"/avatars/52f16d00832d2041ae1cee1e2cdd3fd9.svg","isPro":false,"fullname":"DestinyChildEvo","user":"DestinyChildEvo","type":"user"},{"_id":"6623ea65b642e29cdf90a1b4","avatarUrl":"/avatars/e32e90574c1162b2be87ed78604e3e4d.svg","isPro":true,"fullname":"TongZheng","user":"TongZheng1999","type":"user"},{"_id":"64e826bb21540e1da31e303d","avatarUrl":"/avatars/a5751c2e0238b77703943a51dc990e5e.svg","isPro":false,"fullname":"Donghong Cai","user":"Ilikevegetable","type":"user"},{"_id":"69841959c43b9939c221d28a","avatarUrl":"/avatars/c135a0a54f52faaa5b30ac346dedb46d.svg","isPro":false,"fullname":"Li","user":"Dexter0012","type":"user"},{"_id":"68c157c8cd4dc0f08d4990a8","avatarUrl":"/avatars/ca46060057105ed91e8bf17c59c17894.svg","isPro":false,"fullname":"Paul Ma","user":"ultrama66","type":"user"},{"_id":"69841b9fa289b72fb0514a2a","avatarUrl":"/avatars/cfb73c2bd48de8101e9ce770a4bff6f9.svg","isPro":false,"fullname":"fpsevo","user":"fpsevo","type":"user"},{"_id":"69841ca34a73e1817824f059","avatarUrl":"/avatars/fdb9b804cbd0378da59884fb7d860b95.svg","isPro":false,"fullname":"Jim","user":"Jim332","type":"user"},{"_id":"65e1cd0475ec8a8db683c904","avatarUrl":"/avatars/db5a73e89dde14d48bdbaf7866e48e02.svg","isPro":false,"fullname":"longxiang liu","user":"hikari233","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Training multimodal process reward models efficiently through balanced-information scoring that prioritizes label mixture and reliability while achieving full-data performance with only 10% of training data.
AI-generated summary
Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training.Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora.To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.
Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training. Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora. To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%. Our code is released Balanced-Info-MPRM.