Thank you all for your interest and support for our work! We are doing our best to complete the code review process quickly and aim to make the code public next week. If we are unable to release it by then, it will likely be made available after the Chinese New Year holidays.
\n","updatedAt":"2025-01-17T01:28:04.567Z","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9622402191162109},"editors":["lynazhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png"],"reactions":[],"isReport":false,"parentCommentId":"677f4281ae26cefc2de53a2a"}}]},{"id":"677f6141f3a73984086d098a","author":{"_id":"6768fff776333e03b798588f","avatarUrl":"/avatars/da9d9e4dbaf847cb190f24d448525fd8.svg","fullname":"Jack Padgett","name":"jack8841","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-01-09T05:40:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"holy... shit?","html":"holy... shit?
\n","updatedAt":"2025-01-09T05:40:17.625Z","author":{"_id":"6768fff776333e03b798588f","avatarUrl":"/avatars/da9d9e4dbaf847cb190f24d448525fd8.svg","fullname":"Jack Padgett","name":"jack8841","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4274853467941284},"editors":["jack8841"],"editorAvatarUrls":["/avatars/da9d9e4dbaf847cb190f24d448525fd8.svg"],"reactions":[{"reaction":"š","users":["CherryDurian","nonosignal","taicheng","crazycth"],"count":4},{"reaction":"ā¤ļø","users":["CherryDurian"],"count":1}],"isReport":false}},{"id":"677f82cc190e7a7aa5747299","author":{"_id":"62d668939cf7596385b2a667","avatarUrl":"/avatars/8e768319009e7ae07e6fb8a56146dd74.svg","fullname":"AJAY ASAITHAMBI","name":"ajaiasai2022","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-01-09T08:03:24.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"github link isn't working.","html":"github link isn't working.
\n","updatedAt":"2025-01-09T08:03:24.728Z","author":{"_id":"62d668939cf7596385b2a667","avatarUrl":"/avatars/8e768319009e7ae07e6fb8a56146dd74.svg","fullname":"AJAY ASAITHAMBI","name":"ajaiasai2022","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9903407096862793},"editors":["ajaiasai2022"],"editorAvatarUrls":["/avatars/8e768319009e7ae07e6fb8a56146dd74.svg"],"reactions":[{"reaction":"š","users":["Enigrand","Zigeng","LLDI","zhangfaen","strivin","tappsecurity"],"count":6}],"isReport":false},"replies":[{"id":"677fae7e370f44d9d6abf005","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false},"createdAt":"2025-01-09T11:09:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"As we are still undergoing the internal review process for open-source release, the repository remains private for now. Please stay tuned!","html":"As we are still undergoing the internal review process for open-source release, the repository remains private for now. Please stay tuned!
\n","updatedAt":"2025-01-09T11:09:50.470Z","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9656732082366943},"editors":["lynazhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png"],"reactions":[{"reaction":"š","users":["Tonic","thompsonson","kulhous","9AC8E2","rajaakhil","susumuota","mfeldman143","weird-offspring","Pperera","AndrewSanders","stop-a","TecnoWorld","Piteroid","RFTFT","olborer","yoe","daqc","jeetmface","taicheng","mujji17aug"],"count":20},{"reaction":"ā¤ļø","users":["MarioGuz","icdw","quickdeletion","freemeson","weird-offspring","stop-a","masonjames","tappsecurity"],"count":8}],"isReport":false,"parentCommentId":"677f82cc190e7a7aa5747299"}},{"id":"678148ef61b1cdb33a5d92b4","author":{"_id":"671d4f7741f18bd5d2dd098c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/671d4f7741f18bd5d2dd098c/ejaCheF--DQyQRjG54hEl.jpeg","fullname":"@weird_offspring","name":"weird-offspring","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-01-10T16:21:03.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I'm eagerly waiting to train it on https://huggingface.co/datasets/weird-offspring/meta-thinking ! šš","html":"I'm eagerly waiting to train it on https://huggingface.co/datasets/weird-offspring/meta-thinking ! šš
\n","updatedAt":"2025-01-10T16:21:03.862Z","author":{"_id":"671d4f7741f18bd5d2dd098c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/671d4f7741f18bd5d2dd098c/ejaCheF--DQyQRjG54hEl.jpeg","fullname":"@weird_offspring","name":"weird-offspring","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8024407625198364},"editors":["weird-offspring"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/671d4f7741f18bd5d2dd098c/ejaCheF--DQyQRjG54hEl.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"677f82cc190e7a7aa5747299"}},{"id":"6782f6c450b8d9290a6a08ce","author":{"_id":"636a3d9b32bde55ebc0f9282","avatarUrl":"/avatars/cef22e20423456c4b26c709a94c8f7d9.svg","fullname":"Tecno World","name":"TecnoWorld","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-01-11T22:55:00.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"> As we are still undergoing the internal review process for open-source release, the repository remains private for now. Please stay tuned!\n\nThanks. Please release the repository ASAP.","html":"\n\nAs we are still undergoing the internal review process for open-source release, the repository remains private for now. Please stay tuned!
\n
Thanks. Please release the repository ASAP.
\n","updatedAt":"2025-01-11T22:55:00.150Z","author":{"_id":"636a3d9b32bde55ebc0f9282","avatarUrl":"/avatars/cef22e20423456c4b26c709a94c8f7d9.svg","fullname":"Tecno World","name":"TecnoWorld","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9435495138168335},"editors":["TecnoWorld"],"editorAvatarUrls":["/avatars/cef22e20423456c4b26c709a94c8f7d9.svg"],"reactions":[],"isReport":false,"parentCommentId":"677f82cc190e7a7aa5747299"}}]},{"id":"677f8c30100130181b277503","author":{"_id":"6761500fe5d10c2b311f8c3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/d1oji8SjgngoSvD8K44YS.png","fullname":"Julius Duin","name":"duinamit","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-01-09T08:43:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"very impressive, I love the simplicity of using Q values as annotations! you mention 64 trajectories as some sort of saturation bound, is that right or have you just not tried scaling this approach even more?","html":"very impressive, I love the simplicity of using Q values as annotations! you mention 64 trajectories as some sort of saturation bound, is that right or have you just not tried scaling this approach even more?
\n","updatedAt":"2025-01-09T08:43:28.555Z","author":{"_id":"6761500fe5d10c2b311f8c3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/d1oji8SjgngoSvD8K44YS.png","fullname":"Julius Duin","name":"duinamit","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9567632675170898},"editors":["duinamit"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/d1oji8SjgngoSvD8K44YS.png"],"reactions":[],"isReport":false},"replies":[{"id":"677fb0383a8387f2f338c60e","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false},"createdAt":"2025-01-09T11:17:12.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thank you! On challenging math benchmarks such as AIME, performance nearly saturates with 64 trajectories. For college math, performance continues to improve steadily; however, we did not scale beyond 64 due to the increased search cost. We believe AIME performance can be further improved by synthesizing additional Olympiad-level math problems to improve both the policy model and the process reward model. We leave this as our future work.","html":"Thank you! On challenging math benchmarks such as AIME, performance nearly saturates with 64 trajectories. For college math, performance continues to improve steadily; however, we did not scale beyond 64 due to the increased search cost. We believe AIME performance can be further improved by synthesizing additional Olympiad-level math problems to improve both the policy model and the process reward model. We leave this as our future work.
\n","updatedAt":"2025-01-09T11:17:12.344Z","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9501011371612549},"editors":["lynazhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png"],"reactions":[{"reaction":"š","users":["rldy","AnanasClassic","Ajax0564","duinamit","stefanritterhoff","f0ster","LLDI","altyni","jeffhsu3","InvidFlower","DrishtiSharma","kulhous","royalprogram","weird-offspring","Plopperzz","AndrewSanders","nonosignal","ksapen"],"count":18},{"reaction":"ā¤ļø","users":["icdw","weird-offspring","Plopperzz","dribnet"],"count":4}],"isReport":false,"parentCommentId":"677f8c30100130181b277503"}}]},{"id":"677fdfd54fd536f4fe1fda3d","author":{"_id":"64a4efd2927c1e320ea42479","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a4efd2927c1e320ea42479/4TVG6XrX2Ls3LIt34GEfO.png","fullname":"Ryan Foster","name":"f0ster","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false},"createdAt":"2025-01-09T14:40:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thank you for sharing this work. I appreciate the blend of Monte Carlo Tree Search with smaller models to address step-by-step math reasoning. The idea of generating self-verified solutions rather than relying on a larger teacher model is promising, and it is good to see how you handle the complexity of code-based rollouts. I am curious how this approach might adapt to tasks that involve geometric proofs or more symbolic reasoning. It would also be interesting to learn about the practical limits when problems become highly intricate. Overall, this is a thoughtful piece of research, and I look forward to any future expansions into broader math domains.","html":"Thank you for sharing this work. I appreciate the blend of Monte Carlo Tree Search with smaller models to address step-by-step math reasoning. The idea of generating self-verified solutions rather than relying on a larger teacher model is promising, and it is good to see how you handle the complexity of code-based rollouts. I am curious how this approach might adapt to tasks that involve geometric proofs or more symbolic reasoning. It would also be interesting to learn about the practical limits when problems become highly intricate. Overall, this is a thoughtful piece of research, and I look forward to any future expansions into broader math domains.
\n","updatedAt":"2025-01-09T14:40:21.852Z","author":{"_id":"64a4efd2927c1e320ea42479","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a4efd2927c1e320ea42479/4TVG6XrX2Ls3LIt34GEfO.png","fullname":"Ryan Foster","name":"f0ster","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9554979205131531},"editors":["f0ster"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64a4efd2927c1e320ea42479/4TVG6XrX2Ls3LIt34GEfO.png"],"reactions":[{"reaction":"ā¤ļø","users":["clem"],"count":1}],"isReport":false},"replies":[{"id":"6780b4f142e9291fe753d9ca","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false},"createdAt":"2025-01-10T05:49:37.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thank you for your comments! We currently have limited experience with tasks involving more symbolic reasoning. However, based on our understanding, the MCTS-based approach can adapt well to such tasks. You might find AlphaGeometry (https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/) and DeepSeek-Prover1.5 (https://arxiv.org/abs/2408.08152) to be valuable references for exploring this direction further.","html":"Thank you for your comments! We currently have limited experience with tasks involving more symbolic reasoning. However, based on our understanding, the MCTS-based approach can adapt well to such tasks. You might find AlphaGeometry (https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/) and DeepSeek-Prover1.5 (https://arxiv.org/abs/2408.08152) to be valuable references for exploring this direction further.
\n","updatedAt":"2025-01-10T05:49:37.129Z","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9004413485527039},"editors":["lynazhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png"],"reactions":[{"reaction":"ā¤ļø","users":["f0ster","ksapen"],"count":2}],"isReport":false,"parentCommentId":"677fdfd54fd536f4fe1fda3d"}}]},{"id":"67805fe942e9291fe739e775","author":{"_id":"65240b5866ebe0519859fa42","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65240b5866ebe0519859fa42/IZsxRrvGxZTv6keQ4L0QS.jpeg","fullname":"Matt Larry","name":"mattcracker","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-01-09T23:46:49.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an incredibly impressive paper, and Iām very much looking forward to seeing the open-source code and the detailed development process.","html":"This is an incredibly impressive paper, and Iām very much looking forward to seeing the open-source code and the detailed development process.
\n","updatedAt":"2025-01-09T23:46:49.199Z","author":{"_id":"65240b5866ebe0519859fa42","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65240b5866ebe0519859fa42/IZsxRrvGxZTv6keQ4L0QS.jpeg","fullname":"Matt Larry","name":"mattcracker","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9666573405265808},"editors":["mattcracker"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65240b5866ebe0519859fa42/IZsxRrvGxZTv6keQ4L0QS.jpeg"],"reactions":[],"isReport":false}},{"id":"6780600126f3c665ea0fb2fd","author":{"_id":"663fca744106f0d3384126f2","avatarUrl":"/avatars/a04b0e9953419c94a3fd0ca768bb92a0.svg","fullname":"Yan","name":"yandeng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-01-09T23:47:13.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"We created a deep-dive video for this paper: https://www.youtube.com/watch?v=cHgHS6Y3QP0\nLove to hear your feedback!","html":"We created a deep-dive video for this paper: https://www.youtube.com/watch?v=cHgHS6Y3QP0
Love to hear your feedback!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\nThe following papers were recommended by the Semantic Scholar API
\n- \n
- SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation (2024) \n
- Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS (2024) \n
- BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving (2024) \n
- AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning (2024) \n
- Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions (2024) \n
- Preference Optimization for Reasoning with Pseudo Feedback (2024) \n
- Self-Generated Critiques Boost Reward Modeling for Language Models (2024) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
Very interesting work! I was curious if there is any section addressing data decontamination. From what I understand, Numina Math may include a notable portion of problems from OlympiadBench and Omni-Math.
\n","updatedAt":"2025-01-10T05:07:16.766Z","author":{"_id":"6337e29cc6295341204728c4","avatarUrl":"/avatars/355130b0fb7d43f725e7cfd4b977b093.svg","fullname":"Sadegh Mahdavi","name":"smahdavi4","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9470932483673096},"editors":["smahdavi4"],"editorAvatarUrls":["/avatars/355130b0fb7d43f725e7cfd4b977b093.svg"],"reactions":[{"reaction":"ā¤ļø","users":["Sakits","qpz"],"count":2}],"isReport":false},"replies":[{"id":"6780b196f9fbea83a19830ef","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false},"createdAt":"2025-01-10T05:35:18.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thank you for your question! Decontamination is indeed critical for ensuring unbiased model performance evaluation. We tried our best to address this, including problem matching to identify and remove contaminated training samples from the dataset. For most of our evaluation benchmarks, such as GSM8K, AIME, AMC, CollegeMath and Gaokao, we did not find significant contamination. For MATH, OlympiadBench and Omni-Math, we identified a few hundred potentially contaminated examples and removed them from the training set to maintain the integrity of our evaluations.","html":"Thank you for your question! Decontamination is indeed critical for ensuring unbiased model performance evaluation. We tried our best to address this, including problem matching to identify and remove contaminated training samples from the dataset. For most of our evaluation benchmarks, such as GSM8K, AIME, AMC, CollegeMath and Gaokao, we did not find significant contamination. For MATH, OlympiadBench and Omni-Math, we identified a few hundred potentially contaminated examples and removed them from the training set to maintain the integrity of our evaluations.
\n","updatedAt":"2025-01-10T05:35:18.941Z","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9239806532859802},"editors":["lynazhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png"],"reactions":[{"reaction":"š","users":["smahdavi4","Sakits","ElonTusk2001","alien79","f0ster","yuyijiong","AndrewSanders"],"count":7}],"isReport":false,"parentCommentId":"6780ab04c722411790165e8c"}}]},{"id":"6780b1c392cd2a11c91b6145","author":{"_id":"6327b5df918955fe047e8a36","avatarUrl":"/avatars/413d95f7e1f05068a1c22f9aad2490e3.svg","fullname":"Chunhua Liao","name":"liaoch","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2025-01-10T05:36:03.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is like self-play to learn Go. It should be able to dramatically improve coding skills too. ","html":"This is like self-play to learn Go. It should be able to dramatically improve coding skills too.
\n","updatedAt":"2025-01-10T05:36:03.989Z","author":{"_id":"6327b5df918955fe047e8a36","avatarUrl":"/avatars/413d95f7e1f05068a1c22f9aad2490e3.svg","fullname":"Chunhua Liao","name":"liaoch","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9639997482299805},"editors":["liaoch"],"editorAvatarUrls":["/avatars/413d95f7e1f05068a1c22f9aad2490e3.svg"],"reactions":[],"isReport":false}},{"id":"6780f1f951c64082940010dc","author":{"_id":"669e22436582d2ef7075f44f","avatarUrl":"/avatars/e9ae9d9bf112bcc2b0dba41372988007.svg","fullname":"Tom LUCAS","name":"C0casio45","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-01-10T10:10:01.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"very impressive paper, congrats !","html":"very impressive paper, congrats !
\n","updatedAt":"2025-01-10T10:10:01.805Z","author":{"_id":"669e22436582d2ef7075f44f","avatarUrl":"/avatars/e9ae9d9bf112bcc2b0dba41372988007.svg","fullname":"Tom LUCAS","name":"C0casio45","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9096333384513855},"editors":["C0casio45"],"editorAvatarUrls":["/avatars/e9ae9d9bf112bcc2b0dba41372988007.svg"],"reactions":[],"isReport":false}},{"id":"67810279ddb1f1ce16c72b2c","author":{"_id":"660158b0003d5f16a91a9123","avatarUrl":"/avatars/30af203ade9fde76e71284e6b85dbb26.svg","fullname":"Dmitry Kozlov","name":"Fertel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-01-10T11:20:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is a very nice work. Is it possible to measure original Qwen models with your PPM?\nCould you clarify how trajectory counting works? For instance, I start with 64 trajectories and all of solutions have 10 steps. After each step, I retain only the 32 best paths. Then, I split each of these 32 paths into two, resulting in 64 trajectories again. I repeat this process 10 times (since the solution has 10 steps). In this case, how many trajectories do I have in total? Is it just 64, or is it 64+32Ć10?","html":"This is a very nice work. Is it possible to measure original Qwen models with your PPM?
Could you clarify how trajectory counting works? For instance, I start with 64 trajectories and all of solutions have 10 steps. After each step, I retain only the 32 best paths. Then, I split each of these 32 paths into two, resulting in 64 trajectories again. I repeat this process 10 times (since the solution has 10 steps). In this case, how many trajectories do I have in total? Is it just 64, or is it 64+32Ć10?
Thank you for your questions! We tested Qwen2.5-72B-Instruct with our PPM. Itās not the math-specialized model as it struggles to effectively follow instructions to generate our code-augmented CoT steps. When combined with our 7B PPM, the 72B general model achieves math performance comparable to Qwen2.5-Math-72B-Instruct + Qwen2.5-RM-72B (MATH500: 85.8 vs. 85.0 (ours), AIME: 36.7 vs. 36.7, Olympiad Bench: 54.5 vs. 55.9). Note that math-specialized models typically outperform general-purpose models.
\n","updatedAt":"2025-01-12T04:06:00.268Z","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8622722029685974},"editors":["lynazhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png"],"reactions":[{"reaction":"š","users":["Fertel"],"count":1}],"isReport":false,"parentCommentId":"67810279ddb1f1ce16c72b2c"}},{"id":"67834121805cdf8c77c69f5f","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false},"createdAt":"2025-01-12T04:12:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"For the second question, Iād like to clarify the MCTS generation process. The 64 trajectories are generated through 64 rollouts. For each rollout, we start at the root node (the question) and generate a step-by-step trace until reaching the terminal node (the answer step). Then, based on whether the terminal node's answer is correct, the reward score is backpropagated to update the Q value of every step node in the trajectory. The next rollout repeats this process.","html":"For the second question, Iād like to clarify the MCTS generation process. The 64 trajectories are generated through 64 rollouts. For each rollout, we start at the root node (the question) and generate a step-by-step trace until reaching the terminal node (the answer step). Then, based on whether the terminal node's answer is correct, the reward score is backpropagated to update the Q value of every step node in the trajectory. The next rollout repeats this process.
\n","updatedAt":"2025-01-12T04:12:17.993Z","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9039143919944763},"editors":["lynazhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png"],"reactions":[{"reaction":"š","users":["Fertel"],"count":1}],"isReport":false,"parentCommentId":"67810279ddb1f1ce16c72b2c"}}]},{"id":"6781526206186eb26e95b3d0","author":{"_id":"632ca64a456c3125277a9ed2","avatarUrl":"/avatars/3c74c73800c27859e9baafc5dac24eca.svg","fullname":"Gustav Keppler","name":"gustavkeppler","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-01-10T17:01:22.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Very nice work! I have a question regarding the rollouts of the first round with the bootstrap DeepSeek model. You mentioned that you generate do 5 candidate notes and lets assume all produce running python code. Then you do the rollout and backprop of the first node(as all canidates have 0 visits, UCT is infinity) an and get a 1 or -1 for the q. Then you do the same for the other 4 candidate nodes as they have all 0 visits and therefore the highest UCT value. \n\nThis means just with the candidates of the very first step, you do 5 of the 8 rollouts and therefore you are not going very deep? Or am I missing something here? ","html":"Very nice work! I have a question regarding the rollouts of the first round with the bootstrap DeepSeek model. You mentioned that you generate do 5 candidate notes and lets assume all produce running python code. Then you do the rollout and backprop of the first node(as all canidates have 0 visits, UCT is infinity) an and get a 1 or -1 for the q. Then you do the same for the other 4 candidate nodes as they have all 0 visits and therefore the highest UCT value.
\nThis means just with the candidates of the very first step, you do 5 of the 8 rollouts and therefore you are not going very deep? Or am I missing something here?
\n","updatedAt":"2025-01-10T17:01:22.056Z","author":{"_id":"632ca64a456c3125277a9ed2","avatarUrl":"/avatars/3c74c73800c27859e9baafc5dac24eca.svg","fullname":"Gustav Keppler","name":"gustavkeppler","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9442788362503052},"editors":["gustavkeppler"],"editorAvatarUrls":["/avatars/3c74c73800c27859e9baafc5dac24eca.svg"],"reactions":[],"isReport":false},"replies":[{"id":"678342866417289c50654fd5","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false},"createdAt":"2025-01-12T04:18:14.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Good question! It will go very deep (currently we set a maximum depth=16). Rollout and backpropagation occur only at the terminal node. For each rollout, we start at the root node (the question) and generate a step-by-step trace until reaching the terminal node (the answer step). At the terminal node, we evaluate whether the answer is correct and backpropagate the reward score to update the Q value of every step node in the trajectory. ","html":"Good question! It will go very deep (currently we set a maximum depth=16). Rollout and backpropagation occur only at the terminal node. For each rollout, we start at the root node (the question) and generate a step-by-step trace until reaching the terminal node (the answer step). At the terminal node, we evaluate whether the answer is correct and backpropagate the reward score to update the Q value of every step node in the trajectory.
\n","updatedAt":"2025-01-12T04:18:14.503Z","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8623011708259583},"editors":["lynazhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png"],"reactions":[],"isReport":false,"parentCommentId":"6781526206186eb26e95b3d0"}},{"id":"6786a698cb915f55dcebc5a7","author":{"_id":"632ca64a456c3125277a9ed2","avatarUrl":"/avatars/3c74c73800c27859e9baafc5dac24eca.svg","fullname":"Gustav Keppler","name":"gustavkeppler","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-01-14T18:02:00.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for the explanation! As far as I understand, you generate 1 full trajectory with up to 16 steps, and then do the backprop. But this generates a tree with only 1 child per parent. In the paper you mentioned that there are 5 canidate children. I dont understand at which points the tree is splitting. I would be very happy, if you could clarify this for me!","html":"Thanks for the explanation! As far as I understand, you generate 1 full trajectory with up to 16 steps, and then do the backprop. But this generates a tree with only 1 child per parent. In the paper you mentioned that there are 5 canidate children. I dont understand at which points the tree is splitting. I would be very happy, if you could clarify this for me!
\n","updatedAt":"2025-01-14T18:02:00.901Z","author":{"_id":"632ca64a456c3125277a9ed2","avatarUrl":"/avatars/3c74c73800c27859e9baafc5dac24eca.svg","fullname":"Gustav Keppler","name":"gustavkeppler","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9602522253990173},"editors":["gustavkeppler"],"editorAvatarUrls":["/avatars/3c74c73800c27859e9baafc5dac24eca.svg"],"reactions":[],"isReport":false,"parentCommentId":"6781526206186eb26e95b3d0"}},{"id":"67891a86119caf987824ac55","author":{"_id":"62760da086cf863dc36b350f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1651905782243-62760da086cf863dc36b350f.jpeg","fullname":"Guillermo Barbadillo","name":"ironbar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false},"createdAt":"2025-01-16T14:41:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I agree, the paper is amazing, but the MCTS parametrization is confusing. I would love further clarification.\n\nIn the appendix A.1 it is said that 16 MTCS rollouts are done per problem, and on each step 8 candidate nodes are explored. That number of rollouts seems wrong, you cannot estimate the value of the nodes correctly with such a low number. Maybe it is 16 MTCS rollouts per step? Or per node?\n\nThanks @lynazhang !","html":"I agree, the paper is amazing, but the MCTS parametrization is confusing. I would love further clarification.
\nIn the appendix A.1 it is said that 16 MTCS rollouts are done per problem, and on each step 8 candidate nodes are explored. That number of rollouts seems wrong, you cannot estimate the value of the nodes correctly with such a low number. Maybe it is 16 MTCS rollouts per step? Or per node?
\nThanks \n\n@lynazhang\n\t !
\n","updatedAt":"2025-01-16T14:41:10.392Z","author":{"_id":"62760da086cf863dc36b350f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1651905782243-62760da086cf863dc36b350f.jpeg","fullname":"Guillermo Barbadillo","name":"ironbar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9330504536628723},"editors":["ironbar"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1651905782243-62760da086cf863dc36b350f.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"6781526206186eb26e95b3d0"}},{"id":"6789b825bac8edbc4420d2cf","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false},"createdAt":"2025-01-17T01:53:41.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thank you for your question! @gustavkeppler @ironbar Let me clarify:\n\nIf we generate 5 candidate nodes per step, starting from the root node (the question), the policy model generates 5 candidates for the first step. We then use the PPM to assign initial Q-values to these nodes. If the PPM is unavailable, all Q-values default to 0, equivalent to random selection. Based on the UCT score, we pick the node with the highest score as the first step.\n\nFrom this selected node, the model generates another 5 candidates for the second step, and this process repeats until reaching the terminal answer node. The Q-value of the terminal node, based on whether the answer is correct, is backpropagated to update all nodes in the trajectory. This entire process is referred to as one rollout.\n\nFor the second rollout, we repeat the process. Notably, for the first step, we donāt generate new candidates but instead select the highest-scoring node from the existing ones (because the 5 candidate nodes have already been generated in the first rollout). If this selected node differs from the one chosen in the first rollout, we generate 5 new candidates for the next step and continue the process as before.","html":"Thank you for your question! \n\n@gustavkeppler\n\t \n\n@ironbar\n\t Let me clarify:
\nIf we generate 5 candidate nodes per step, starting from the root node (the question), the policy model generates 5 candidates for the first step. We then use the PPM to assign initial Q-values to these nodes. If the PPM is unavailable, all Q-values default to 0, equivalent to random selection. Based on the UCT score, we pick the node with the highest score as the first step.
\nFrom this selected node, the model generates another 5 candidates for the second step, and this process repeats until reaching the terminal answer node. The Q-value of the terminal node, based on whether the answer is correct, is backpropagated to update all nodes in the trajectory. This entire process is referred to as one rollout.
\nFor the second rollout, we repeat the process. Notably, for the first step, we donāt generate new candidates but instead select the highest-scoring node from the existing ones (because the 5 candidate nodes have already been generated in the first rollout). If this selected node differs from the one chosen in the first rollout, we generate 5 new candidates for the next step and continue the process as before.
\n","updatedAt":"2025-01-17T01:53:41.804Z","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.876041829586029},"editors":["lynazhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png"],"reactions":[{"reaction":"ā¤ļø","users":["ironbar","gustavkeppler"],"count":2}],"isReport":false,"parentCommentId":"6781526206186eb26e95b3d0"}},{"id":"6789e7b57e307a1184188946","author":{"_id":"62760da086cf863dc36b350f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1651905782243-62760da086cf863dc36b350f.jpeg","fullname":"Guillermo Barbadillo","name":"ironbar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false},"createdAt":"2025-01-17T05:16:37.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for the detailed explanation!\n\nI would recommend adding it to the paper because as far as I know this is not the standard MCTS algorithm. \nWhen you do the rollout on the standard MCTS there is no expansion of the nodes, and you just make one move per node until you reach the terminal node.\n\n\n\nYour custom MCTS would work very well once you have the PPM, but I believe the number of rollouts might be very small for the initialization of the PPM, so there might be room for improvement there. \n\nIn the paper, it is said that you use the exploration of round 2 to initialize the PPM. In that round 8 candidate nodes are explored at each step, and 16 rollouts are done per problem according to the paper. Exploration is promoted over exploitation with a constant `c` set to 2. \nThat implies that probably in the first step each node might have been explored just 2 times. In my opinion, that is a very small number to estimate the value of the node. However, the problem gets worse because on step 2 there will also be 8 candidate nodes, but only 2 out of 8 nodes would have been expanded with rollouts and just one rollout for each. And from step 3 it's very likely that only one node of the eight is explored with just one rollout.\n\nIn my opinion, with 16 rollouts per problem, the search tree is barely explored, most of the nodes won't have any information to estimate the Q value. Do you agree?\n\n","html":"Thanks for the detailed explanation!
\nI would recommend adding it to the paper because as far as I know this is not the standard MCTS algorithm.
When you do the rollout on the standard MCTS there is no expansion of the nodes, and you just make one move per node until you reach the terminal node.
Your custom MCTS would work very well once you have the PPM, but I believe the number of rollouts might be very small for the initialization of the PPM, so there might be room for improvement there.
\nIn the paper, it is said that you use the exploration of round 2 to initialize the PPM. In that round 8 candidate nodes are explored at each step, and 16 rollouts are done per problem according to the paper. Exploration is promoted over exploitation with a constant c set to 2.
That implies that probably in the first step each node might have been explored just 2 times. In my opinion, that is a very small number to estimate the value of the node. However, the problem gets worse because on step 2 there will also be 8 candidate nodes, but only 2 out of 8 nodes would have been expanded with rollouts and just one rollout for each. And from step 3 it's very likely that only one node of the eight is explored with just one rollout.
In my opinion, with 16 rollouts per problem, the search tree is barely explored, most of the nodes won't have any information to estimate the Q value. Do you agree?
\n","updatedAt":"2025-01-17T05:16:37.038Z","author":{"_id":"62760da086cf863dc36b350f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1651905782243-62760da086cf863dc36b350f.jpeg","fullname":"Guillermo Barbadillo","name":"ironbar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.947993814945221},"editors":["ironbar"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1651905782243-62760da086cf863dc36b350f.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"6781526206186eb26e95b3d0"}},{"id":"6789ef3ca8d2357cce037b33","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false},"createdAt":"2025-01-17T05:48:44.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for the discussion! Iād like to clarify that we use a large c to encourage exploration during the initial rollouts. After several successful rollouts (i.e., obtaining the correct answer at the terminal step), the Q-values of intermediate nodes along these successful trajectories increase significantly. As a result, subsequent rollouts tend to focus more on exploitation. In our experiments, 16 rollouts strike a good balance between increased computational cost and Q-value effectiveness. You can find our code for more details (if everything goes smoothly, we plan to release it next week).","html":"Thanks for the discussion! Iād like to clarify that we use a large c to encourage exploration during the initial rollouts. After several successful rollouts (i.e., obtaining the correct answer at the terminal step), the Q-values of intermediate nodes along these successful trajectories increase significantly. As a result, subsequent rollouts tend to focus more on exploitation. In our experiments, 16 rollouts strike a good balance between increased computational cost and Q-value effectiveness. You can find our code for more details (if everything goes smoothly, we plan to release it next week).
\n","updatedAt":"2025-01-17T05:48:44.311Z","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9259185791015625},"editors":["lynazhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png"],"reactions":[{"reaction":"š","users":["ironbar"],"count":1}],"isReport":false,"parentCommentId":"6781526206186eb26e95b3d0"}},{"id":"678a00c8053a8c5d791ddb7c","author":{"_id":"62760da086cf863dc36b350f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1651905782243-62760da086cf863dc36b350f.jpeg","fullname":"Guillermo Barbadillo","name":"ironbar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false},"createdAt":"2025-01-17T07:03:36.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks, but I still think 16 is too low. As a reference, AlphaGo Zero does 1600 iterations per move, so that would be around 1.6e5 iterations per match (the average length of a game is 200 moves). \nThere are 4 orders of magnitude of difference (1.6e1 vs 1.6e5).","html":"Thanks, but I still think 16 is too low. As a reference, AlphaGo Zero does 1600 iterations per move, so that would be around 1.6e5 iterations per match (the average length of a game is 200 moves).
There are 4 orders of magnitude of difference (1.6e1 vs 1.6e5).
\n\n@lynazhang\n\t Thanks again for the more detailed explanation of the first round of trajectory generation with the DeepSeek model. Your algorithm is not a typical MCTS as \n\n@ironbar\n\t metioned.
One more question about the 1 round(without the PPM):
After the first rollout, all nodes that lead to an right or wrong answer get an updated q values and are visitied once. In the next rollout, as all other 4 childer of the root node have 0 visists, the UCT algorithm will select one of those. In the third round another one of the 3 left with 0 visits will be used. Therefore the first step with 5 different possiblities for this first step will use up 5 of the 8 rollouts. For the variance of the second step only 3 rollouts are left.
You are mainly doing a random variation of the first step. Am I correct with this?
The Q value : instead of using rollout-traceback values as state value, you trained a PPM by pair-wise contrastive loss.
\nThe training pair (to train PRM) is obtained by using PRM itself, as well as the final answers correctness to further filter out. Is this understanding correct ?
If correct, then how often do you update the PRM ?
Thanks
Thank you for the question! As illustrated in Fig. 1(b), the training pairs are filtered from the MCTS tree (which is constructed using both the policy model and the PPM). The PRM (i.e., the PPM) is updated only once at the end of each self-evolution round, just like the policy model.
\n","updatedAt":"2025-01-12T04:24:30.061Z","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9408337473869324},"editors":["lynazhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png"],"reactions":[{"reaction":"š","users":["berenluthien"],"count":1}],"isReport":false,"parentCommentId":"6781835a3b0c1d63ad32fce2"}}]},{"id":"67829d76ee2c727b099343fc","author":{"_id":"6391928091023bed86fa5657","avatarUrl":"/avatars/1e519ec033c0045187bc2f20b5af42a4.svg","fullname":"Gyanendra Das","name":"luckygyana","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-01-11T16:33:58.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Checkout detailed paper explanation : https://gyanendradas.substack.com/p/rstar-paper-explained","html":"Checkout detailed paper explanation : https://gyanendradas.substack.com/p/rstar-paper-explained
\n","updatedAt":"2025-01-12T07:46:30.166Z","author":{"_id":"6391928091023bed86fa5657","avatarUrl":"/avatars/1e519ec033c0045187bc2f20b5af42a4.svg","fullname":"Gyanendra Das","name":"luckygyana","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.6176710724830627},"editors":["luckygyana"],"editorAvatarUrls":["/avatars/1e519ec033c0045187bc2f20b5af42a4.svg"],"reactions":[],"isReport":false}},{"id":"678476a3a22796d3c07ebfd0","author":{"_id":"665edfcf2b842ec980842bd4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665edfcf2b842ec980842bd4/GJHNPJ3ULIMEMq6VGxZaI.png","fullname":"AI Papers Academy","name":"aipapersacademy","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2025-01-13T02:12:51.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Sharing a video & written summary - https://aipapersacademy.com/rstar-math/","html":"Sharing a video & written summary - https://aipapersacademy.com/rstar-math/
\n","updatedAt":"2025-01-13T02:12:51.023Z","author":{"_id":"665edfcf2b842ec980842bd4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665edfcf2b842ec980842bd4/GJHNPJ3ULIMEMq6VGxZaI.png","fullname":"AI Papers Academy","name":"aipapersacademy","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.80121910572052},"editors":["aipapersacademy"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/665edfcf2b842ec980842bd4/GJHNPJ3ULIMEMq6VGxZaI.png"],"reactions":[],"isReport":false}},{"id":"6786a1ce275c13dd721e7403","author":{"_id":"677b52b2bad7b7aa5c3d80fd","avatarUrl":"/avatars/b74a2ad576d81f70078ce35ed9b160ee.svg","fullname":"christian anderson","name":"Plopperzz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-01-14T17:41:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Is there an ETA on the github going public?","html":"Is there an ETA on the github going public?
\n","updatedAt":"2025-01-14T17:41:34.842Z","author":{"_id":"677b52b2bad7b7aa5c3d80fd","avatarUrl":"/avatars/b74a2ad576d81f70078ce35ed9b160ee.svg","fullname":"christian anderson","name":"Plopperzz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8794031143188477},"editors":["Plopperzz"],"editorAvatarUrls":["/avatars/b74a2ad576d81f70078ce35ed9b160ee.svg"],"reactions":[],"isReport":false}},{"id":"678cc2282dfe5dd60c36d2cc","author":{"_id":"62d4fcaeae54b23e464df130","avatarUrl":"/avatars/8d13568a74898293adb4f5b45955dc74.svg","fullname":"Tudouni","name":"Tudouni","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-01-19T09:13:12.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"helloļ¼why this page is 404 now : https://github.com/microsoft/rStar","html":"helloļ¼why this page is 404 now : https://github.com/microsoft/rStar
\n","updatedAt":"2025-01-19T09:13:12.949Z","author":{"_id":"62d4fcaeae54b23e464df130","avatarUrl":"/avatars/8d13568a74898293adb4f5b45955dc74.svg","fullname":"Tudouni","name":"Tudouni","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5712054967880249},"editors":["Tudouni"],"editorAvatarUrls":["/avatars/8d13568a74898293adb4f5b45955dc74.svg"],"reactions":[{"reaction":"š„","users":["zstanjj"],"count":1}],"isReport":false},"replies":[{"id":"678e0c8dd13c19c917ba5520","author":{"_id":"62e52483a944e2a56cd2c6ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e52483a944e2a56cd2c6ca/xElM_6teIrP3QI-Run0Bl.jpeg","fullname":"Jiejun Tan","name":"zstanjj","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":18,"isUserFollowing":false},"createdAt":"2025-01-20T08:42:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"same problem","html":"same problem
\n","updatedAt":"2025-01-20T08:42:53.829Z","author":{"_id":"62e52483a944e2a56cd2c6ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e52483a944e2a56cd2c6ca/xElM_6teIrP3QI-Run0Bl.jpeg","fullname":"Jiejun Tan","name":"zstanjj","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":18,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8417139053344727},"editors":["zstanjj"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62e52483a944e2a56cd2c6ca/xElM_6teIrP3QI-Run0Bl.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"678cc2282dfe5dd60c36d2cc"}},{"id":"678e19e183921fb9052c3943","author":{"_id":"65f955d0c312ee009f8262bd","avatarUrl":"/avatars/c40bcedb3002d5ac57ef3c497f1b1602.svg","fullname":"Dimitar Iliev Dimitrov","name":"dimitadi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false},"createdAt":"2025-01-20T09:39:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"The authors are working on releasing it this week- check the comments above","html":"The authors are working on releasing it this week- check the comments above
\n","updatedAt":"2025-01-20T09:39:45.021Z","author":{"_id":"65f955d0c312ee009f8262bd","avatarUrl":"/avatars/c40bcedb3002d5ac57ef3c497f1b1602.svg","fullname":"Dimitar Iliev Dimitrov","name":"dimitadi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9267230033874512},"editors":["dimitadi"],"editorAvatarUrls":["/avatars/c40bcedb3002d5ac57ef3c497f1b1602.svg"],"reactions":[],"isReport":false,"parentCommentId":"678cc2282dfe5dd60c36d2cc"}}]},{"id":"678e517579e40f55d86f56ae","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false},"createdAt":"2025-01-20T13:36:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Our code is now available at https://github.com/microsoft/rStar ","html":"Our code is now available at https://github.com/microsoft/rStar
\n","updatedAt":"2025-01-20T13:36:53.293Z","author":{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","fullname":"Li Lyna Zhang","name":"lynazhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":36,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8188626766204834},"editors":["lynazhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png"],"reactions":[],"isReport":false},"replies":[{"id":"678e56d9410152f480f3da75","author":{"_id":"65f955d0c312ee009f8262bd","avatarUrl":"/avatars/c40bcedb3002d5ac57ef3c497f1b1602.svg","fullname":"Dimitar Iliev Dimitrov","name":"dimitadi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false},"createdAt":"2025-01-20T13:59:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Will you make the model weights also available ?","html":"Will you make the model weights also available ?
\n","updatedAt":"2025-01-20T13:59:53.673Z","author":{"_id":"65f955d0c312ee009f8262bd","avatarUrl":"/avatars/c40bcedb3002d5ac57ef3c497f1b1602.svg","fullname":"Dimitar Iliev Dimitrov","name":"dimitadi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9482316970825195},"editors":["dimitadi"],"editorAvatarUrls":["/avatars/c40bcedb3002d5ac57ef3c497f1b1602.svg"],"reactions":[{"reaction":"š„","users":["tomoneko","Plopperzz","Testinglolyddgj","nov05"],"count":4}],"isReport":false,"parentCommentId":"678e517579e40f55d86f56ae"}}]},{"id":"67a7a63e1119ebde3a317b95","author":{"_id":"67a4c9fd4652ba3e82868568","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67a4c9fd4652ba3e82868568/pWk8LzqQ-N_f0E8fkYCpr.png","fullname":"swaraj","name":"swaraj56","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-02-08T18:45:18.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"can anybody help me with the latest issue in here https://github.com/microsoft/rStar/issues/9\nalso , for training data what did they used question and full solution or just final answer, im not sure about that","html":"can anybody help me with the latest issue in here https://github.com/microsoft/rStar/issues/9
also , for training data what did they used question and full solution or just final answer, im not sure about that
\n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-05-31T23:07:55.895Z","author":{"_id":"681b86baef3886afeea020ea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/fqxZlJxIh803nWvndGOfW.png","fullname":"Anwar","name":"abdoali5672","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["abdoali5672"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/fqxZlJxIh803nWvndGOfW.png"],"reactions":[],"isReport":false}},{"id":"694af0497890558bf066e523","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2025-12-23T19:40:57.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXiv lens breakdown of this paper š https://arxivlens.com/PaperView/Details/rstar-math-small-llms-can-master-math-reasoning-with-self-evolved-deep-thinking-259-c5ebf611\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"arXiv lens breakdown of this paper š https://arxivlens.com/PaperView/Details/rstar-math-small-llms-can-master-math-reasoning-with-self-evolved-deep-thinking-259-c5ebf611
\n- \n
- Executive Summary \n
- Detailed Breakdown \n
- Practical Applications \n
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Abstract
rStar-Math enhances small language models' math reasoning capabilities through Monte Carlo Tree Search and self-evolution, achieving state-of-the-art performance on various benchmarks without distillation from larger models.
We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.
Community
We present rStar-Math to demonstrate that small language models (SLMs, 1.5B-7B) can rival or even surpass the math reasoning capability of OpenAI o1
Dear Lyna, I've sent you an email on Friday (ks@turingpost.com), could you please check it and let me know what you think? Many thanks! Ksenia
holy... shit?
As we are still undergoing the internal review process for open-source release, the repository remains private for now. Please stay tuned!
very impressive, I love the simplicity of using Q values as annotations! you mention 64 trajectories as some sort of saturation bound, is that right or have you just not tried scaling this approach even more?
Thank you! On challenging math benchmarks such as AIME, performance nearly saturates with 64 trajectories. For college math, performance continues to improve steadily; however, we did not scale beyond 64 due to the increased search cost. We believe AIME performance can be further improved by synthesizing additional Olympiad-level math problems to improve both the policy model and the process reward model. We leave this as our future work.
Thank you for sharing this work. I appreciate the blend of Monte Carlo Tree Search with smaller models to address step-by-step math reasoning. The idea of generating self-verified solutions rather than relying on a larger teacher model is promising, and it is good to see how you handle the complexity of code-based rollouts. I am curious how this approach might adapt to tasks that involve geometric proofs or more symbolic reasoning. It would also be interesting to learn about the practical limits when problems become highly intricate. Overall, this is a thoughtful piece of research, and I look forward to any future expansions into broader math domains.
Thank you for your comments! We currently have limited experience with tasks involving more symbolic reasoning. However, based on our understanding, the MCTS-based approach can adapt well to such tasks. You might find AlphaGeometry (https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/) and DeepSeek-Prover1.5 (https://arxiv.org/abs/2408.08152) to be valuable references for exploring this direction further.
This is an incredibly impressive paper, and Iām very much looking forward to seeing the open-source code and the detailed development process.
We created a deep-dive video for this paper: https://www.youtube.com/watch?v=cHgHS6Y3QP0
Love to hear your feedback!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation (2024)
- Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS (2024)
- BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving (2024)
- AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning (2024)
- Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions (2024)
- Preference Optimization for Reasoning with Pseudo Feedback (2024)
- Self-Generated Critiques Boost Reward Modeling for Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Very interesting work! I was curious if there is any section addressing data decontamination. From what I understand, Numina Math may include a notable portion of problems from OlympiadBench and Omni-Math.
Thank you for your question! Decontamination is indeed critical for ensuring unbiased model performance evaluation. We tried our best to address this, including problem matching to identify and remove contaminated training samples from the dataset. For most of our evaluation benchmarks, such as GSM8K, AIME, AMC, CollegeMath and Gaokao, we did not find significant contamination. For MATH, OlympiadBench and Omni-Math, we identified a few hundred potentially contaminated examples and removed them from the training set to maintain the integrity of our evaluations.
This is like self-play to learn Go. It should be able to dramatically improve coding skills too.
very impressive paper, congrats !
This is a very nice work. Is it possible to measure original Qwen models with your PPM?
Could you clarify how trajectory counting works? For instance, I start with 64 trajectories and all of solutions have 10 steps. After each step, I retain only the 32 best paths. Then, I split each of these 32 paths into two, resulting in 64 trajectories again. I repeat this process 10 times (since the solution has 10 steps). In this case, how many trajectories do I have in total? Is it just 64, or is it 64+32Ć10?
Thank you for your questions! We tested Qwen2.5-72B-Instruct with our PPM. Itās not the math-specialized model as it struggles to effectively follow instructions to generate our code-augmented CoT steps. When combined with our 7B PPM, the 72B general model achieves math performance comparable to Qwen2.5-Math-72B-Instruct + Qwen2.5-RM-72B (MATH500: 85.8 vs. 85.0 (ours), AIME: 36.7 vs. 36.7, Olympiad Bench: 54.5 vs. 55.9). Note that math-specialized models typically outperform general-purpose models.
Very nice work! I have a question regarding the rollouts of the first round with the bootstrap DeepSeek model. You mentioned that you generate do 5 candidate notes and lets assume all produce running python code. Then you do the rollout and backprop of the first node(as all canidates have 0 visits, UCT is infinity) an and get a 1 or -1 for the q. Then you do the same for the other 4 candidate nodes as they have all 0 visits and therefore the highest UCT value.
This means just with the candidates of the very first step, you do 5 of the 8 rollouts and therefore you are not going very deep? Or am I missing something here?
Good question! It will go very deep (currently we set a maximum depth=16). Rollout and backpropagation occur only at the terminal node. For each rollout, we start at the root node (the question) and generate a step-by-step trace until reaching the terminal node (the answer step). At the terminal node, we evaluate whether the answer is correct and backpropagate the reward score to update the Q value of every step node in the trajectory.
The Q value : instead of using rollout-traceback values as state value, you trained a PPM by pair-wise contrastive loss.
The training pair (to train PRM) is obtained by using PRM itself, as well as the final answers correctness to further filter out. Is this understanding correct ?
If correct, then how often do you update the PRM ?
Thanks
Thank you for the question! As illustrated in Fig. 1(b), the training pairs are filtered from the MCTS tree (which is constructed using both the policy model and the PPM). The PRM (i.e., the PPM) is updated only once at the end of each self-evolution round, just like the policy model.
Is there an ETA on the github going public?
same problem
Will you make the model weights also available ?
can anybody help me with the latest issue in here https://github.com/microsoft/rStar/issues/9
also , for training data what did they used question and full solution or just final answer, im not sure about that
arXiv lens breakdown of this paper š https://arxivlens.com/PaperView/Details/rstar-math-small-llms-can-master-math-reasoning-with-self-evolved-deep-thinking-259-c5ebf611
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper