Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - The Lessons of Developing Process Reward Models in Mathematical Reasoning
[go: Go Back, main page]

https://huggingface.co/Qwen/Qwen2.5-Math-PRM-7B\n
  • https://huggingface.co/Qwen/Qwen2.5-Math-PRM-72B
  • \n\n

    \"image.png\"

    \n","updatedAt":"2025-01-14T04:28:02.896Z","author":{"_id":"610b70452719facd4ea85e28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/610b70452719facd4ea85e28/S7nMy7D0Rxq0VIVblhYDG.jpeg","fullname":"Chujie Zheng","name":"chujiezheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":85,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7021718621253967},"editors":["chujiezheng"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/610b70452719facd4ea85e28/S7nMy7D0Rxq0VIVblhYDG.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"6785e771ed40f86acfe9dcfc","author":{"_id":"610b70452719facd4ea85e28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/610b70452719facd4ea85e28/S7nMy7D0Rxq0VIVblhYDG.jpeg","fullname":"Chujie Zheng","name":"chujiezheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":85,"isUserFollowing":false},"createdAt":"2025-01-14T04:26:25.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2025-01-14T04:28:27.981Z","author":{"_id":"610b70452719facd4ea85e28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/610b70452719facd4ea85e28/S7nMy7D0Rxq0VIVblhYDG.jpeg","fullname":"Chujie Zheng","name":"chujiezheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":85,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[],"parentCommentId":"6785e6afb38bde92194f4604"}}]},{"id":"6785e7863e93cbae2f0abd50","author":{"_id":"64704e973601bb7b06643e98","avatarUrl":"/avatars/52e51f4d1be6769e4397b8be2799cf32.svg","fullname":"Zhenru Zhang","name":"Zhenru","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false},"createdAt":"2025-01-14T04:26:46.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2025-01-14T04:27:23.870Z","author":{"_id":"64704e973601bb7b06643e98","avatarUrl":"/avatars/52e51f4d1be6769e4397b8be2799cf32.svg","fullname":"Zhenru Zhang","name":"Zhenru","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"6787105f80f02d9698d294eb","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-01-15T01:33:19.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Entropy-Regularized Process Reward Model](https://huggingface.co/papers/2412.11006) (2024)\n* [Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search](https://huggingface.co/papers/2501.01478) (2025)\n* [Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning](https://huggingface.co/papers/2412.15797) (2024)\n* [ProcessBench: Identifying Process Errors in Mathematical Reasoning](https://huggingface.co/papers/2412.06559) (2024)\n* [PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models](https://huggingface.co/papers/2501.03124) (2025)\n* [PSPO*: An Effective Process-supervised Policy Optimization for Reasoning Alignment](https://huggingface.co/papers/2411.11681) (2024)\n* [Outcome-Refining Process Supervision for Code Generation](https://huggingface.co/papers/2412.15118) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

    This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

    \n

    The following papers were recommended by the Semantic Scholar API

    \n\n

    Please give a thumbs up to this comment if you found it helpful!

    \n

    If you want recommendations for any Paper on Hugging Face checkout this Space

    \n

    You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

    \n","updatedAt":"2025-01-15T01:33:19.115Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7324545979499817},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"👍","users":["Jia-py"],"count":1}],"isReport":false}},{"id":"67873c6df0eea4066a6b0e25","author":{"_id":"63b1091543aaf9276fee0ad4","avatarUrl":"/avatars/0a051b6795b323a3c8fbb48fd1b33089.svg","fullname":"Yungi Hwang","name":"Quadyun","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2025-01-15T04:41:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"At second line of \"2.3 Evaluation Results\", \"Qwen2.5-Math-7B-PRM-MC-hard (trained with soft labels)\" is not corret. That is actually \"Qwen2.5-Math-7B-PRM-MC-soft (trained with soft labels)\", right? ","html":"

    At second line of \"2.3 Evaluation Results\", \"Qwen2.5-Math-7B-PRM-MC-hard (trained with soft labels)\" is not corret. That is actually \"Qwen2.5-Math-7B-PRM-MC-soft (trained with soft labels)\", right?

    \n","updatedAt":"2025-01-15T04:41:17.812Z","author":{"_id":"63b1091543aaf9276fee0ad4","avatarUrl":"/avatars/0a051b6795b323a3c8fbb48fd1b33089.svg","fullname":"Yungi Hwang","name":"Quadyun","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.924466609954834},"editors":["Quadyun"],"editorAvatarUrls":["/avatars/0a051b6795b323a3c8fbb48fd1b33089.svg"],"reactions":[],"isReport":false},"replies":[{"id":"678745522631abf6969916ba","author":{"_id":"64704e973601bb7b06643e98","avatarUrl":"/avatars/52e51f4d1be6769e4397b8be2799cf32.svg","fullname":"Zhenru Zhang","name":"Zhenru","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false},"createdAt":"2025-01-15T05:19:14.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Yes, it is \"Qwen2.5-Math-7B-PRM-MC-soft (trained with soft labels)\". Sorry for the typo and we will fix it!","html":"

    Yes, it is \"Qwen2.5-Math-7B-PRM-MC-soft (trained with soft labels)\". Sorry for the typo and we will fix it!

    \n","updatedAt":"2025-01-15T05:19:14.298Z","author":{"_id":"64704e973601bb7b06643e98","avatarUrl":"/avatars/52e51f4d1be6769e4397b8be2799cf32.svg","fullname":"Zhenru Zhang","name":"Zhenru","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9096161127090454},"editors":["Zhenru"],"editorAvatarUrls":["/avatars/52e51f4d1be6769e4397b8be2799cf32.svg"],"reactions":[{"reaction":"🔥","users":["Quadyun"],"count":1}],"isReport":false,"parentCommentId":"67873c6df0eea4066a6b0e25"}},{"id":"67874ba347124cd60fb456dd","author":{"_id":"63b1091543aaf9276fee0ad4","avatarUrl":"/avatars/0a051b6795b323a3c8fbb48fd1b33089.svg","fullname":"Yungi Hwang","name":"Quadyun","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2025-01-15T05:46:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks! Awesome research about Reward Model!","html":"

    Thanks! Awesome research about Reward Model!

    \n","updatedAt":"2025-01-15T05:46:11.659Z","author":{"_id":"63b1091543aaf9276fee0ad4","avatarUrl":"/avatars/0a051b6795b323a3c8fbb48fd1b33089.svg","fullname":"Yungi Hwang","name":"Quadyun","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6995078921318054},"editors":["Quadyun"],"editorAvatarUrls":["/avatars/0a051b6795b323a3c8fbb48fd1b33089.svg"],"reactions":[],"isReport":false,"parentCommentId":"67873c6df0eea4066a6b0e25"}}]},{"id":"67874b8fe1d96e8156ca2ea2","author":{"_id":"63b1091543aaf9276fee0ad4","avatarUrl":"/avatars/0a051b6795b323a3c8fbb48fd1b33089.svg","fullname":"Yungi Hwang","name":"Quadyun","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2025-01-15T05:45:51.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2025-01-15T05:46:04.509Z","author":{"_id":"63b1091543aaf9276fee0ad4","avatarUrl":"/avatars/0a051b6795b323a3c8fbb48fd1b33089.svg","fullname":"Yungi Hwang","name":"Quadyun","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}}],"primaryEmailConfirmed":false,"paper":{"id":"2501.07301","authors":[{"_id":"6785e601117627fe711f8dd8","user":{"_id":"64704e973601bb7b06643e98","avatarUrl":"/avatars/52e51f4d1be6769e4397b8be2799cf32.svg","isPro":false,"fullname":"Zhenru Zhang","user":"Zhenru","type":"user"},"name":"Zhenru Zhang","status":"admin_assigned","statusLastChangedAt":"2025-01-14T08:48:52.704Z","hidden":false},{"_id":"6785e601117627fe711f8dd9","user":{"_id":"610b70452719facd4ea85e28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/610b70452719facd4ea85e28/S7nMy7D0Rxq0VIVblhYDG.jpeg","isPro":false,"fullname":"Chujie Zheng","user":"chujiezheng","type":"user"},"name":"Chujie Zheng","status":"claimed_verified","statusLastChangedAt":"2025-01-14T08:28:43.769Z","hidden":false},{"_id":"6785e601117627fe711f8dda","user":{"_id":"671f2a2ab02f88b7e2385105","avatarUrl":"/avatars/a97e2f983c8290b72eeb03e2bdcaf085.svg","isPro":false,"fullname":"Yangzhen Wu","user":"wuyangzhen","type":"user"},"name":"Yangzhen Wu","status":"admin_assigned","statusLastChangedAt":"2025-01-14T08:49:01.963Z","hidden":false},{"_id":"6785e601117627fe711f8ddb","user":{"_id":"64b93578ee257c3a4cfceed1","avatarUrl":"/avatars/e6188562254f75a09b4048b800860016.svg","isPro":false,"fullname":"Beichen Zhang","user":"BeichenZhang","type":"user"},"name":"Beichen Zhang","status":"admin_assigned","statusLastChangedAt":"2025-01-14T08:49:43.861Z","hidden":false},{"_id":"6785e601117627fe711f8ddc","user":{"_id":"649a52e5de0fb7f3f499e583","avatarUrl":"/avatars/25f6106fa168ae57ad5cd8ef55c70d31.svg","isPro":false,"fullname":"Runji Lin","user":"RunjiLin","type":"user"},"name":"Runji Lin","status":"admin_assigned","statusLastChangedAt":"2025-01-14T08:49:56.685Z","hidden":false},{"_id":"6785e601117627fe711f8ddd","user":{"_id":"6438b43ab2ea24b52ebac2b9","avatarUrl":"/avatars/84133cd719a4b1e2f5c1a74178425f86.svg","isPro":false,"fullname":"Bowen Yu","user":"bwy","type":"user"},"name":"Bowen Yu","status":"admin_assigned","statusLastChangedAt":"2025-01-14T08:50:12.316Z","hidden":false},{"_id":"6785e601117627fe711f8dde","user":{"_id":"6434d4989bd5a84b5dd0b0f5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434d4989bd5a84b5dd0b0f5/0Elf9qbfG9Hkgypm9pTGm.jpeg","isPro":false,"fullname":"Dayiheng Liu","user":"Losin94","type":"user"},"name":"Dayiheng Liu","status":"admin_assigned","statusLastChangedAt":"2025-01-14T08:50:18.536Z","hidden":false},{"_id":"6785e601117627fe711f8ddf","user":{"_id":"602f88f5e8149a962412a667","avatarUrl":"/avatars/b78f0e583df8e5d5e3365934fe5f4900.svg","isPro":false,"fullname":"Zhou","user":"Jingren","type":"user"},"name":"Jingren Zhou","status":"admin_assigned","statusLastChangedAt":"2025-01-14T08:50:29.301Z","hidden":false},{"_id":"6785e601117627fe711f8de0","user":{"_id":"620760a26e3b7210c2ff1943","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620760a26e3b7210c2ff1943/VC-rKqimF6yxGESNVlPoR.jpeg","isPro":false,"fullname":"Junyang Lin","user":"JustinLin610","type":"user"},"name":"Junyang Lin","status":"admin_assigned","statusLastChangedAt":"2025-01-14T08:50:35.067Z","hidden":false}],"publishedAt":"2025-01-13T13:10:16.000Z","submittedOnDailyAt":"2025-01-14T01:53:11.463Z","title":"The Lessons of Developing Process Reward Models in Mathematical\n Reasoning","submittedOnDailyBy":{"_id":"610b70452719facd4ea85e28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/610b70452719facd4ea85e28/S7nMy7D0Rxq0VIVblhYDG.jpeg","isPro":false,"fullname":"Chujie Zheng","user":"chujiezheng","type":"user"},"summary":"Process Reward Models (PRMs) emerge as a promising approach for process\nsupervision in mathematical reasoning of Large Language Models (LLMs), which\naim to identify and mitigate intermediate errors in the reasoning processes.\nHowever, the development of effective PRMs faces significant challenges,\nparticularly in data annotation and evaluation methodologies. In this paper,\nthrough extensive experiments, we demonstrate that commonly used Monte Carlo\n(MC) estimation-based data synthesis for PRMs typically yields inferior\nperformance and generalization compared to LLM-as-a-judge and human annotation\nmethods. MC estimation relies on completion models to evaluate current-step\ncorrectness, leading to inaccurate step verification. Furthermore, we identify\npotential biases in conventional Best-of-N (BoN) evaluation strategies for\nPRMs: (1) The unreliable policy models generate responses with correct answers\nbut flawed processes, leading to a misalignment between the evaluation criteria\nof BoN and the PRM objectives of process verification. (2) The tolerance of\nPRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a\nsignificant proportion of minimum scores concentrated on the final answer\nsteps, revealing the shift from process to outcome-based assessment in BoN\nOptimized PRMs. To address these challenges, we develop a consensus filtering\nmechanism that effectively integrates MC estimation with LLM-as-a-judge and\nadvocates a more comprehensive evaluation framework that combines\nresponse-level and step-level metrics. Based on the mechanisms, we\nsignificantly improve both model performance and data efficiency in the BoN\nevaluation and the step-wise error identification task. Finally, we release a\nnew state-of-the-art PRM that outperforms existing open-source alternatives and\nprovides practical guidelines for future research in building process\nsupervision models.","upvotes":100,"discussionId":"6785e602117627fe711f8e15","ai_summary":"Process Reward Models improve performance in mathematical reasoning by integrating Monte Carlo estimation with LLM-as-a-judge, addressing biases in Best-of-N evaluation and step-wise error identification.","ai_keywords":["Process Reward Models","Monte Carlo estimation","LLM-as-a-judge","Best-of-N evaluation","process verification","step-wise error identification"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"610b70452719facd4ea85e28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/610b70452719facd4ea85e28/S7nMy7D0Rxq0VIVblhYDG.jpeg","isPro":false,"fullname":"Chujie Zheng","user":"chujiezheng","type":"user"},{"_id":"64704e973601bb7b06643e98","avatarUrl":"/avatars/52e51f4d1be6769e4397b8be2799cf32.svg","isPro":false,"fullname":"Zhenru Zhang","user":"Zhenru","type":"user"},{"_id":"646df403ad20c6fa4f30b7ec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646df403ad20c6fa4f30b7ec/Q64-XMghOcBoo3itZDGYA.jpeg","isPro":false,"fullname":"Jiaxi Yang","user":"jx-yang","type":"user"},{"_id":"64546a5aec40bbbd01249894","avatarUrl":"/avatars/86684250b456c07b7cb035d8a6f53e46.svg","isPro":false,"fullname":"Dang Kai","user":"dangkai-nk","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"6434d4989bd5a84b5dd0b0f5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434d4989bd5a84b5dd0b0f5/0Elf9qbfG9Hkgypm9pTGm.jpeg","isPro":false,"fullname":"Dayiheng Liu","user":"Losin94","type":"user"},{"_id":"63d9d68c1cae35c27bf7a6a7","avatarUrl":"/avatars/b5ad98cf269ae5f1fe90861fb4170fae.svg","isPro":false,"fullname":"Bowen Yu","user":"Tigerph","type":"user"},{"_id":"64c4aa2ad43e4dee516cad84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/gWevp56k74cd--UH8-k2v.jpeg","isPro":false,"fullname":"Nan Xu","user":"xunannancy","type":"user"},{"_id":"621660534b06824e1e5022ed","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1645633727256-621660534b06824e1e5022ed.jpeg","isPro":false,"fullname":"LI Jia","user":"liyongsea","type":"user"},{"_id":"620760a26e3b7210c2ff1943","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620760a26e3b7210c2ff1943/VC-rKqimF6yxGESNVlPoR.jpeg","isPro":false,"fullname":"Junyang Lin","user":"JustinLin610","type":"user"},{"_id":"65294b334d7cf551ac50d6a6","avatarUrl":"/avatars/75d21e20b711b871616ef3850bb900b7.svg","isPro":false,"fullname":"ChengpengLi","user":"ChengpengLi","type":"user"},{"_id":"64cd02704fa04d4d72cfd290","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64cd02704fa04d4d72cfd290/lHF8SBHHv1l4ggQimHFfY.jpeg","isPro":false,"fullname":"wjm","user":"wjmzzz","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
    Papers
    arxiv:2501.07301

    The Lessons of Developing Process Reward Models in Mathematical Reasoning

    Published on Jan 13, 2025
    · Submitted by
    Chujie Zheng
    on Jan 14, 2025
    #1 Paper of the day

    Abstract

    Process Reward Models improve performance in mathematical reasoning by integrating Monte Carlo estimation with LLM-as-a-judge, addressing biases in Best-of-N evaluation and step-wise error identification.

    AI-generated summary

    Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.

    Community

    Paper author Paper submitter
    edited Jan 14, 2025

    We share our practices and lessons on building process reward models (PRMs) for mathematical reasoning, and release two strong PRMs:

    image.png

    ·
    Paper author
    This comment has been hidden
    Paper author
    This comment has been hidden

    This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

    The following papers were recommended by the Semantic Scholar API

    Please give a thumbs up to this comment if you found it helpful!

    If you want recommendations for any Paper on Hugging Face checkout this Space

    You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

    At second line of "2.3 Evaluation Results", "Qwen2.5-Math-7B-PRM-MC-hard (trained with soft labels)" is not corret. That is actually "Qwen2.5-Math-7B-PRM-MC-soft (trained with soft labels)", right?

    ·
    Paper author

    Yes, it is "Qwen2.5-Math-7B-PRM-MC-soft (trained with soft labels)". Sorry for the typo and we will fix it!

    This comment has been hidden

    Sign up or log in to comment

    Models citing this paper 3

    Datasets citing this paper 0

    No dataset linking this paper

    Cite arxiv.org/abs/2501.07301 in a dataset README.md to link it from this page.

    Spaces citing this paper 1

    Collections including this paper 29