Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
[go: Go Back, main page]

https://github.com/ritzz-ai/PACS.

\n","updatedAt":"2025-09-03T03:33:35.147Z","author":{"_id":"6355eaf660c1b72f6269bc64","avatarUrl":"/avatars/dd176b9d6db2ac63c19c3170566a3f35.svg","fullname":"Jiaming Li","name":"Geaming","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8948332667350769},"editors":["Geaming"],"editorAvatarUrls":["/avatars/dd176b9d6db2ac63c19c3170566a3f35.svg"],"reactions":[],"isReport":false}},{"id":"68b7b87b14f4f62d0d1fde42","author":{"_id":"60f6d3a1d8a2eab9d601114d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1626788727335-noauth.jpeg","fullname":"Wanwei He","name":"Ancient","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-09-03T03:39:39.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"great work!","html":"

great work!

\n","updatedAt":"2025-09-03T03:39:39.359Z","author":{"_id":"60f6d3a1d8a2eab9d601114d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1626788727335-noauth.jpeg","fullname":"Wanwei He","name":"Ancient","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9278039932250977},"editors":["Ancient"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1626788727335-noauth.jpeg"],"reactions":[],"isReport":false}},{"id":"68d163527e5a526d71d59b79","author":{"_id":"665f7a325c16c0112f0cda8e","avatarUrl":"/avatars/be6705aed0de66994d8454c7eb63c4be.svg","fullname":"Shawn Nie","name":"shawn2333","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-09-22T14:55:14.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@librarian-bot recommend","html":"

\n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-09-22T14:55:14.938Z","author":{"_id":"665f7a325c16c0112f0cda8e","avatarUrl":"/avatars/be6705aed0de66994d8454c7eb63c4be.svg","fullname":"Shawn Nie","name":"shawn2333","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["shawn2333"],"editorAvatarUrls":["/avatars/be6705aed0de66994d8454c7eb63c4be.svg"],"reactions":[],"isReport":false},"replies":[{"id":"68d164f3ba7e4da9f2dc76a2","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-09-22T15:02:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement](https://huggingface.co/papers/2508.00410) (2025)\n* [Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models](https://huggingface.co/papers/2508.05613) (2025)\n* [COPO: Consistency-Aware Policy Optimization](https://huggingface.co/papers/2508.04138) (2025)\n* [RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization](https://huggingface.co/papers/2508.00222) (2025)\n* [Libra: Assessing and Improving Reward Model by Learning to Think](https://huggingface.co/papers/2507.21645) (2025)\n* [Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR](https://huggingface.co/papers/2508.14029) (2025)\n* [TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference](https://huggingface.co/papers/2509.15110) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-09-22T15:02:11.604Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7493727207183838},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"68d163527e5a526d71d59b79"}},{"id":"694db50d1c2c18ff21c51b28","author":{"_id":"694db4989f80f8f7264e2b5c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/sFtXOr8Es-5AjcG6HilJr.jpeg","fullname":"Nora Malley","name":"getnormality","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-12-25T22:05:01.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","hiddenReason":"Resolved","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2025-12-25T22:14:38.120Z","author":{"_id":"694db4989f80f8f7264e2b5c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/sFtXOr8Es-5AjcG6HilJr.jpeg","fullname":"Nora Malley","name":"getnormality","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[],"parentCommentId":"68d163527e5a526d71d59b79"}}]},{"id":"694db73be65df592d46cd7f5","author":{"_id":"694db4989f80f8f7264e2b5c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/sFtXOr8Es-5AjcG6HilJr.jpeg","fullname":"Nora Malley","name":"getnormality","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-12-25T22:14:19.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@librarian-bot recommend","html":"

\n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-12-25T22:14:19.826Z","author":{"_id":"694db4989f80f8f7264e2b5c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/sFtXOr8Es-5AjcG6HilJr.jpeg","fullname":"Nora Malley","name":"getnormality","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["getnormality"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/sFtXOr8Es-5AjcG6HilJr.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2509.02522","authors":[{"_id":"68b7ad8a295f15ff609112bb","user":{"_id":"6355eaf660c1b72f6269bc64","avatarUrl":"/avatars/dd176b9d6db2ac63c19c3170566a3f35.svg","isPro":false,"fullname":"Jiaming Li","user":"Geaming","type":"user"},"name":"Jiaming Li","status":"claimed_verified","statusLastChangedAt":"2025-09-03T08:27:32.504Z","hidden":false},{"_id":"68b7ad8a295f15ff609112bc","user":{"_id":"64c7b4d1c547ed5243c07b6c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7b4d1c547ed5243c07b6c/h96CLBj6dcm01soK2UJzr.jpeg","isPro":false,"fullname":"Longze Chen","user":"lzchen2001","type":"user"},"name":"Longze Chen","status":"claimed_verified","statusLastChangedAt":"2025-09-04T08:44:21.407Z","hidden":false},{"_id":"68b7ad8a295f15ff609112bd","name":"Ze Gong","hidden":false},{"_id":"68b7ad8a295f15ff609112be","name":"Yukun Chen","hidden":false},{"_id":"68b7ad8a295f15ff609112bf","name":"Lu Wang","hidden":false},{"_id":"68b7ad8a295f15ff609112c0","name":"Wanwei He","hidden":false},{"_id":"68b7ad8a295f15ff609112c1","name":"Run Luo","hidden":false},{"_id":"68b7ad8a295f15ff609112c2","name":"Min Yang","hidden":false}],"publishedAt":"2025-09-02T17:22:46.000Z","submittedOnDailyAt":"2025-09-03T02:03:35.135Z","title":"Implicit Actor Critic Coupling via a Supervised Learning Framework for\n RLVR","submittedOnDailyBy":{"_id":"6355eaf660c1b72f6269bc64","avatarUrl":"/avatars/dd176b9d6db2ac63c19c3170566a3f35.svg","isPro":false,"fullname":"Jiaming Li","user":"Geaming","type":"user"},"summary":"Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have\nempowered large language models (LLMs) to tackle challenging reasoning tasks\nsuch as mathematics and programming. RLVR leverages verifiable outcome rewards\nto guide policy optimization, enabling LLMs to progressively improve output\nquality in a grounded and reliable manner. Despite its promise, the RLVR\nparadigm poses significant challenges, as existing methods often suffer from\nsparse reward signals and unstable policy gradient updates, particularly in\nRL-based approaches. To address the challenges, we propose PACS, a\nnovel RLVR framework that achieves imPlicit Actor\nCritic coupling via a Supervised learning framework. By\ntreating the outcome reward as a predictable label, we reformulate the RLVR\nproblem into a supervised learning task over a score function parameterized by\nthe policy model and optimized using cross-entropy loss. A detailed gradient\nanalysis shows that this supervised formulation inherently recovers the\nclassical policy gradient update while implicitly coupling actor and critic\nroles, yielding more stable and efficient training. Benchmarking on challenging\nmathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as\nPPO and GRPO, achieving superior reasoning performance. For instance, PACS\nachieves 59.78\\% at pass@256 on AIME 2025, representing improvements of 13.32\nand 14.36 points over PPO and GRPO. This simple yet powerful framework offers a\npromising avenue for LLMs post-training with verifiable rewards. Our code and\ndata are available as open source at https://github.com/ritzz-ai/PACS.","upvotes":26,"discussionId":"68b7ad8a295f15ff609112c3","githubRepo":"https://github.com/ritzz-ai/PACS","githubRepoAddedBy":"user","ai_summary":"PACS, a novel RLVR framework, reformulates RLVR as a supervised learning task, improving stability and efficiency in training large language models for reasoning tasks.","ai_keywords":["Reinforcement Learning with Verifiable Rewards (RLVR)","large language models (LLMs)","verifiable outcome rewards","policy optimization","sparse reward signals","unstable policy gradient updates","PACS","Implicit Actor Critic coupling","supervised learning","score function","cross-entropy loss","policy gradient update","AIME 2025","pass@256","PPO","GRPO"],"githubStars":31},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6355eaf660c1b72f6269bc64","avatarUrl":"/avatars/dd176b9d6db2ac63c19c3170566a3f35.svg","isPro":false,"fullname":"Jiaming Li","user":"Geaming","type":"user"},{"_id":"643cf325043254ce10905b34","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/sD-O8vQb5mvg0ClTudZiS.jpeg","isPro":false,"fullname":"YkChen","user":"SupreCyk","type":"user"},{"_id":"64c7b4d1c547ed5243c07b6c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7b4d1c547ed5243c07b6c/h96CLBj6dcm01soK2UJzr.jpeg","isPro":false,"fullname":"Longze Chen","user":"lzchen2001","type":"user"},{"_id":"60f6d3a1d8a2eab9d601114d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1626788727335-noauth.jpeg","isPro":false,"fullname":"Wanwei He","user":"Ancient","type":"user"},{"_id":"64c38871f9cd765462fa1a17","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c38871f9cd765462fa1a17/yuIlVcqeDlQVKsUF8uEl3.jpeg","isPro":false,"fullname":"Lei Zhang","user":"Lemoncoke","type":"user"},{"_id":"6826d7b3d86f53ff98393d8a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/3ymvuYMfPc0iyXTpH-Ss_.png","isPro":false,"fullname":"Zhihao Yang","user":"hanoiy","type":"user"},{"_id":"68b7b88d2cd7d0e4439d7fa4","avatarUrl":"/avatars/575c141ca9ce7fd46b1f2254385394a4.svg","isPro":false,"fullname":"DGY","user":"DuGuangyu","type":"user"},{"_id":"68b7b9593ab85779cd36a351","avatarUrl":"/avatars/e6f383c136c435e9185d938952e6a50f.svg","isPro":false,"fullname":"Ze Gong","user":"ze2gong1","type":"user"},{"_id":"640701cb4dc5f2846c91d4eb","avatarUrl":"/avatars/b72330ddccfd02ed010efb7591a1b21e.svg","isPro":false,"fullname":"paralym","user":"paralym","type":"user"},{"_id":"62e670d33651180f7d334ef3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661829952739-62e670d33651180f7d334ef3.jpeg","isPro":false,"fullname":"Li Yunshui","user":"Wa2erGo","type":"user"},{"_id":"65d96897d4cb1534aca61817","avatarUrl":"/avatars/b47c389f4172ca4d22cc09e93cef4dcf.svg","isPro":false,"fullname":"Jing Luo","user":"jingluo","type":"user"},{"_id":"684a357d16495c8ee2d6b872","avatarUrl":"/avatars/7ad30f8c973abf2222f6f5da5a68b94f.svg","isPro":false,"fullname":"siqinyu","user":"sign6657","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2509.02522

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

Published on Sep 2, 2025
· Submitted by
Jiaming Li
on Sep 3, 2025
Authors:
,
,
,
,
,

Abstract

PACS, a novel RLVR framework, reformulates RLVR as a supervised learning task, improving stability and efficiency in training large language models for reasoning tasks.

AI-generated summary

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling LLMs to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose PACS, a novel RLVR framework that achieves imPlicit Actor Critic coupling via a Supervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78\% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.

Community

Paper author Paper submitter

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling LLMs to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose PACS, a novel RLVR framework that achieves imPlicit Actor Critic coupling via a Supervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.

great work!

@librarian-bot recommend

·

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

@librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.02522 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.02522 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.02522 in a Space README.md to link it from this page.

Collections including this paper 8