Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-01-22T01:36:28.982Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6962460279464722},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.12294","authors":[{"_id":"697055b4a8be625b19c2af18","user":{"_id":"6474e1afb68461d5cf7c41cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6474e1afb68461d5cf7c41cc/bcoiD_qPrjHUBlB259djg.png","isPro":false,"fullname":"Dawei Li","user":"wjldw","type":"user"},"name":"Dawei Li","status":"admin_assigned","statusLastChangedAt":"2026-01-21T12:00:07.519Z","hidden":false},{"_id":"697055b4a8be625b19c2af19","user":{"_id":"644cb05d778ecbfb9783fd8b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/QSjINHKRs1OLz8Y34j8Ak.png","isPro":false,"fullname":"Yuguang Yao","user":"yaoyugua","type":"user"},"name":"Yuguang Yao","status":"admin_assigned","statusLastChangedAt":"2026-01-21T12:00:12.781Z","hidden":false},{"_id":"697055b4a8be625b19c2af1a","name":"Zhen Tan","hidden":false},{"_id":"697055b4a8be625b19c2af1b","name":"Huan Liu","hidden":false},{"_id":"697055b4a8be625b19c2af1c","user":{"_id":"65dcb410bda21d181b38321b","avatarUrl":"/avatars/a9caed79c4eb14352b4015377fcae1d7.svg","isPro":false,"fullname":"Ruocheng Guo","user":"rguo12","type":"user"},"name":"Ruocheng Guo","status":"admin_assigned","statusLastChangedAt":"2026-01-21T12:00:23.178Z","hidden":false}],"publishedAt":"2026-01-18T07:48:36.000Z","submittedOnDailyAt":"2026-01-21T01:59:37.515Z","title":"ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents","submittedOnDailyBy":{"_id":"6474e1afb68461d5cf7c41cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6474e1afb68461d5cf7c41cc/bcoiD_qPrjHUBlB259djg.png","isPro":false,"fullname":"Dawei Li","user":"wjldw","type":"user"},"summary":"Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool-using settings. In this paper, we introduce ToolPRMBench, a large-scale benchmark specifically designed to evaluate PRMs for tool-using agents. ToolPRMBench is built on top of several representative tool-using benchmarks and converts agent trajectories into step-level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single-step errors and online sampling to capture realistic multi-step failures from full agent rollouts. A multi-LLM verification pipeline is proposed to reduce label noise and ensure data quality. We conduct extensive experiments across large language models, general PRMs, and tool-specialized PRMs on ToolPRMBench. The results reveal clear differences in PRM effectiveness and highlight the potential of specialized PRMs for tool-using. Code and data will be released at https://github.com/David-Li0406/ToolPRMBench.","upvotes":17,"discussionId":"697055b4a8be625b19c2af1d","githubRepo":"https://github.com/David-Li0406/ToolPRMBench","githubRepoAddedBy":"auto","ai_summary":"ToolPRMBench is introduced as a large-scale benchmark for evaluating process reward models in tool-using agents, featuring step-level test cases and multi-LLM verification to ensure data quality.","ai_keywords":["process reward models","tool-using agents","reward-guided search","agent trajectories","step-level rewards","large language models","multi-LLM verification","offline sampling","online sampling"],"githubStars":3,"organization":{"_id":"64e917fc662874dbc9b6a828","name":"intuit","fullname":"Intuit","avatar":"https://cdn-uploads.huggingface.co/production/uploads/61ac8f8a00d01045fca0ad2f/AGLb0CFLiqEd5BBKLvtPO.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6474e1afb68461d5cf7c41cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6474e1afb68461d5cf7c41cc/bcoiD_qPrjHUBlB259djg.png","isPro":false,"fullname":"Dawei Li","user":"wjldw","type":"user"},{"_id":"65b2fae679954e21ac426aec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b2fae679954e21ac426aec/LybSb_awygRTQinm1npUq.jpeg","isPro":false,"fullname":"Chengshuai Zhao","user":"chengshuaizhao","type":"user"},{"_id":"68943f442de6de449e423a34","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ROuW68JIRLosvkypMQYtL.png","isPro":false,"fullname":"Dawei Li","user":"wjldw2","type":"user"},{"_id":"68943fda467608a0142eccb3","avatarUrl":"/avatars/1365350dd30974118012e3e2e0573c8b.svg","isPro":false,"fullname":"Wujia Hao","user":"paperReader","type":"user"},{"_id":"689444145d8cb782b8579a1f","avatarUrl":"/avatars/5102e1458b7e59c3593b6344304b0747.svg","isPro":false,"fullname":"Jiren Lai","user":"Lajjj","type":"user"},{"_id":"6894453303e47d990aade1c6","avatarUrl":"/avatars/92a9369537348f317be96dea030e90f9.svg","isPro":false,"fullname":"Bill Avan","user":"BillAvan","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63870c8388b39a64e1e8cdfa","avatarUrl":"/avatars/1813b49eca6eb7396fa18cccc6e24342.svg","isPro":false,"fullname":"zhanghengyuan","user":"hengyuanya","type":"user"},{"_id":"63f61cf29cbd673030273620","avatarUrl":"/avatars/996904a6c4cd77c2fbe338a327eda96f.svg","isPro":false,"fullname":"Shiping Yang","user":"maybenotime","type":"user"},{"_id":"68954a0c6f99de530a60d3cd","avatarUrl":"/avatars/1b93f222527b839ac7d75972f4b12bda.svg","isPro":false,"fullname":"William Herry","user":"LoveWH","type":"user"},{"_id":"65b2fabb14031ba201cb5aa8","avatarUrl":"/avatars/a5880006c9f00de84e779497e3eb84a5.svg","isPro":false,"fullname":"Zhen Tan","user":"ztan36","type":"user"},{"_id":"65535b54140fc44a74d43635","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/MIrD8OzDKF2aI38i7ZPjR.jpeg","isPro":false,"fullname":"Zhisong Qiu","user":"consultantQ","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"64e917fc662874dbc9b6a828","name":"intuit","fullname":"Intuit","avatar":"https://cdn-uploads.huggingface.co/production/uploads/61ac8f8a00d01045fca0ad2f/AGLb0CFLiqEd5BBKLvtPO.jpeg"}}">
ToolPRMBench is introduced as a large-scale benchmark for evaluating process reward models in tool-using agents, featuring step-level test cases and multi-LLM verification to ensure data quality.
AI-generated summary
Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool-using settings. In this paper, we introduce ToolPRMBench, a large-scale benchmark specifically designed to evaluate PRMs for tool-using agents. ToolPRMBench is built on top of several representative tool-using benchmarks and converts agent trajectories into step-level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single-step errors and online sampling to capture realistic multi-step failures from full agent rollouts. A multi-LLM verification pipeline is proposed to reduce label noise and ensure data quality. We conduct extensive experiments across large language models, general PRMs, and tool-specialized PRMs on ToolPRMBench. The results reveal clear differences in PRM effectiveness and highlight the potential of specialized PRMs for tool-using. Code and data will be released at https://github.com/David-Li0406/ToolPRMBench.