@librarian-bot\n\t

\n","updatedAt":"2025-09-16T00:31:58.485Z","author":{"_id":"60d3e619b8448e1785bbda2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d3e619b8448e1785bbda2a/q2re5u1HNwsCCyIMtid_I.jpeg","fullname":"GUIJIN SON","name":"amphora","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":78,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7558995485305786},"editors":["amphora"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/60d3e619b8448e1785bbda2a/q2re5u1HNwsCCyIMtid_I.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.20868","authors":[{"_id":"67f40ac23e188ee5436eadc3","name":"Lei Yang","hidden":false},{"_id":"67f40ac23e188ee5436eadc4","name":"Renren Jin","hidden":false},{"_id":"67f40ac23e188ee5436eadc5","name":"Ling Shi","hidden":false},{"_id":"67f40ac23e188ee5436eadc6","name":"Jianxiang Peng","hidden":false},{"_id":"67f40ac23e188ee5436eadc7","name":"Yue Chen","hidden":false},{"_id":"67f40ac23e188ee5436eadc8","name":"Deyi Xiong","hidden":false}],"publishedAt":"2025-02-28T09:12:42.000Z","title":"ProBench: Benchmarking Large Language Models in Competitive Programming","summary":"With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging,\nlarge language models (LLMs) have entered a new phase of development. However,\nexisting benchmarks for coding evaluation are gradually inadequate to assess\nthe capability of advanced LLMs in code reasoning. To bridge the gap for\nhigh-level code reasoning assessment, we propose ProBench to benchmark LLMs in\ncompetitive programming, drawing inspiration from the International Collegiate\nProgramming Contest. ProBench collects a comprehensive set of competitive\nprogramming problems from Codeforces, Luogu, and Nowcoder platforms during the\nperiod from July to December 2024, obtaining real test results through online\nsubmissions to ensure the fairness and accuracy of the evaluation. We establish\na unified problem attribute system, including difficulty grading and algorithm\ntagging. With carefully collected and annotated data in ProBench, we\nsystematically assess 9 latest LLMs in competitive programming across multiple\ndimensions, including thought chain analysis, error type diagnosis, and\nreasoning depth evaluation. Experimental results show that QwQ-32B-Preview\nachieves the best score of 20.93 followed by DeepSeek-V3 with a score of 16.38,\nsuggesting that models trained with specialized reasoning tasks significantly\noutperform general-purpose models (even larger than reasoning-oriented models)\nin programming. Further analysis also reveals key areas for programming\ncapability enhancement, e.g., algorithm adaptability and reasoning sufficiency,\nproviding important insights for the future development of reasoning models.","upvotes":0,"discussionId":"67f40ac33e188ee5436eae2e","ai_summary":"ProBench benchmarks advanced LLMs in competitive programming using real test results, identifying specialized reasoning models as superior and highlighting areas for improvement.","ai_keywords":["large language models","LLMs","competitive programming","International Collegiate Programming Contest","Codeforces","Luogu","Nowcoder","thought chain analysis","error type diagnosis","reasoning depth evaluation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["*"]}">

arxiv:2502.20868

ProBench: Benchmarking Large Language Models in Competitive Programming

Published on Feb 28, 2025

Upvote

Authors:

Abstract

ProBench benchmarks advanced LLMs in competitive programming using real test results, identifying specialized reasoning models as superior and highlighting areas for improvement.

AI-generated summary

With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging, large language models (LLMs) have entered a new phase of development. However, existing benchmarks for coding evaluation are gradually inadequate to assess the capability of advanced LLMs in code reasoning. To bridge the gap for high-level code reasoning assessment, we propose ProBench to benchmark LLMs in competitive programming, drawing inspiration from the International Collegiate Programming Contest. ProBench collects a comprehensive set of competitive programming problems from Codeforces, Luogu, and Nowcoder platforms during the period from July to December 2024, obtaining real test results through online submissions to ensure the fairness and accuracy of the evaluation. We establish a unified problem attribute system, including difficulty grading and algorithm tagging. With carefully collected and annotated data in ProBench, we systematically assess 9 latest LLMs in competitive programming across multiple dimensions, including thought chain analysis, error type diagnosis, and reasoning depth evaluation. Experimental results show that QwQ-32B-Preview achieves the best score of 20.93 followed by DeepSeek-V3 with a score of 16.38, suggesting that models trained with specialized reasoning tasks significantly outperform general-purpose models (even larger than reasoning-oriented models) in programming. Further analysis also reveals key areas for programming capability enhancement, e.g., algorithm adaptability and reasoning sufficiency, providing important insights for the future development of reasoning models.