Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation
\n","updatedAt":"2026-01-02T03:56:00.103Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6759045124053955},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2512.23703","authors":[{"_id":"6953cdc209c8c0a5381814e1","user":{"_id":"66b5dc0b854ad316cf835ab4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b5dc0b854ad316cf835ab4/8gOWw81rV5la7mzjw_qRv.jpeg","isPro":false,"fullname":"tanhuajie2001","user":"tanhuajie2001","type":"user"},"name":"Huajie Tan","status":"claimed_verified","statusLastChangedAt":"2025-12-31T20:55:02.091Z","hidden":false},{"_id":"6953cdc209c8c0a5381814e2","name":"Sixiang Chen","hidden":false},{"_id":"6953cdc209c8c0a5381814e3","user":{"_id":"6672dbb6bf11c2d4404ce64a","avatarUrl":"/avatars/655f3c4bf44850be9f3e73be3a64f353.svg","isPro":false,"fullname":"Yijie Xu","user":"YijieXuJoey","type":"user"},"name":"Yijie Xu","status":"claimed_verified","statusLastChangedAt":"2025-12-31T20:54:58.653Z","hidden":false},{"_id":"6953cdc209c8c0a5381814e4","user":{"_id":"66cb02bd315af068a95b4cd1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ES7lfvsN-wF_r2fFRNEPx.png","isPro":false,"fullname":"zixiao_bios","user":"zixiao-bios","type":"user"},"name":"Zixiao Wang","status":"claimed_verified","statusLastChangedAt":"2025-12-31T20:54:55.887Z","hidden":false},{"_id":"6953cdc209c8c0a5381814e5","name":"Yuheng Ji","hidden":false},{"_id":"6953cdc209c8c0a5381814e6","name":"Cheng Chi","hidden":false},{"_id":"6953cdc209c8c0a5381814e7","name":"Yaoxu Lyu","hidden":false},{"_id":"6953cdc209c8c0a5381814e8","name":"Zhongxia Zhao","hidden":false},{"_id":"6953cdc209c8c0a5381814e9","name":"Xiansheng Chen","hidden":false},{"_id":"6953cdc209c8c0a5381814ea","name":"Peterson Co","hidden":false},{"_id":"6953cdc209c8c0a5381814eb","name":"Shaoxuan Xie","hidden":false},{"_id":"6953cdc209c8c0a5381814ec","name":"Guocai Yao","hidden":false},{"_id":"6953cdc209c8c0a5381814ed","name":"Pengwei Wang","hidden":false},{"_id":"6953cdc209c8c0a5381814ee","name":"Zhongyuan Wang","hidden":false},{"_id":"6953cdc209c8c0a5381814ef","name":"Shanghang Zhang","hidden":false}],"publishedAt":"2025-12-29T18:57:44.000Z","submittedOnDailyAt":"2025-12-30T10:35:25.428Z","title":"Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation","submittedOnDailyBy":{"_id":"66b5dc0b854ad316cf835ab4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b5dc0b854ad316cf835ab4/8gOWw81rV5la7mzjw_qRv.jpeg","isPro":false,"fullname":"tanhuajie2001","user":"tanhuajie2001","type":"user"},"summary":"The primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization. To address these, we introduce Dopamine-Reward, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Perspective Reward Fusion to overcome perceptual limitations. Building upon Dopamine-Reward, we propose Dopamine-RL, a robust policy learning framework that employs a theoretically-sound Policy-Invariant Reward Shaping method, which enables the agent to leverage dense rewards for efficient self-improvement without altering the optimal policy, thereby fundamentally avoiding the semantic trap. Extensive experiments across diverse simulated and real-world tasks validate our approach. GRM achieves state-of-the-art accuracy in reward assessment, and Dopamine-RL built on GRM significantly improves policy learning efficiency. For instance, after GRM is adapted to a new task in a one-shot manner from a single expert trajectory, the resulting reward model enables Dopamine-RL to improve the policy from near-zero to 95% success with only 150 online rollouts (approximately 1 hour of real robot interaction), while retaining strong generalization across tasks. Project website: https://robo-dopamine.github.io","upvotes":7,"discussionId":"6953cdc209c8c0a5381814f0","projectPage":"https://robo-dopamine.github.io/","githubRepo":"https://github.com/FlagOpen/Robo-Dopamine","githubRepoAddedBy":"user","ai_summary":"A novel reward modeling approach called Dopamine-Reward addresses limitations in reinforcement learning for robotics by introducing a step-aware process reward model and theoretically sound reward shaping to improve policy learning efficiency and generalization.","ai_keywords":["Process Reward Models","step-aware understanding","multi-view perception","reward modeling","General Reward Model","Step-wise Reward Discretization","Multi-Perspective Reward Fusion","Policy-Invariant Reward Shaping","policy learning","reinforcement learning"],"githubStars":166,"organization":{"_id":"61be9739d2f9358e24ca0a4f","name":"BAAI","fullname":"Beijing Academy of Artificial Intelligence","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1664511063789-632c234f42c386ebd2710434.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66b5dc0b854ad316cf835ab4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b5dc0b854ad316cf835ab4/8gOWw81rV5la7mzjw_qRv.jpeg","isPro":false,"fullname":"tanhuajie2001","user":"tanhuajie2001","type":"user"},{"_id":"6672dbb6bf11c2d4404ce64a","avatarUrl":"/avatars/655f3c4bf44850be9f3e73be3a64f353.svg","isPro":false,"fullname":"Yijie Xu","user":"YijieXuJoey","type":"user"},{"_id":"66cb02bd315af068a95b4cd1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ES7lfvsN-wF_r2fFRNEPx.png","isPro":false,"fullname":"zixiao_bios","user":"zixiao-bios","type":"user"},{"_id":"63f08dc79cf89c9ed1bb89cd","avatarUrl":"/avatars/37290358ad00bbd752f519cfdec02f3e.svg","isPro":false,"fullname":"Zhoues","user":"Zhoues","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"668f5478b3991ac0c3fc9c2f","avatarUrl":"/avatars/a775853d3b88e7b1c8494ca837b5495c.svg","isPro":false,"fullname":"yuhengji","user":"yuheng2000","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"61be9739d2f9358e24ca0a4f","name":"BAAI","fullname":"Beijing Academy of Artificial Intelligence","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1664511063789-632c234f42c386ebd2710434.png"}}">
A novel reward modeling approach called Dopamine-Reward addresses limitations in reinforcement learning for robotics by introducing a step-aware process reward model and theoretically sound reward shaping to improve policy learning efficiency and generalization.
AI-generated summary
The primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization. To address these, we introduce Dopamine-Reward, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Perspective Reward Fusion to overcome perceptual limitations. Building upon Dopamine-Reward, we propose Dopamine-RL, a robust policy learning framework that employs a theoretically-sound Policy-Invariant Reward Shaping method, which enables the agent to leverage dense rewards for efficient self-improvement without altering the optimal policy, thereby fundamentally avoiding the semantic trap. Extensive experiments across diverse simulated and real-world tasks validate our approach. GRM achieves state-of-the-art accuracy in reward assessment, and Dopamine-RL built on GRM significantly improves policy learning efficiency. For instance, after GRM is adapted to a new task in a one-shot manner from a single expert trajectory, the resulting reward model enables Dopamine-RL to improve the policy from near-zero to 95% success with only 150 online rollouts (approximately 1 hour of real robot interaction), while retaining strong generalization across tasks. Project website: https://robo-dopamine.github.io