Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - ProBench: Benchmarking Large Language Models in Competitive Programming
@librarian-bot\n\t \n","updatedAt":"2025-09-16T00:31:58.485Z","author":{"_id":"60d3e619b8448e1785bbda2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d3e619b8448e1785bbda2a/q2re5u1HNwsCCyIMtid_I.jpeg","fullname":"GUIJIN SON","name":"amphora","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":78,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7558995485305786},"editors":["amphora"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/60d3e619b8448e1785bbda2a/q2re5u1HNwsCCyIMtid_I.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.20868","authors":[{"_id":"67f40ac23e188ee5436eadc3","name":"Lei Yang","hidden":false},{"_id":"67f40ac23e188ee5436eadc4","name":"Renren Jin","hidden":false},{"_id":"67f40ac23e188ee5436eadc5","name":"Ling Shi","hidden":false},{"_id":"67f40ac23e188ee5436eadc6","name":"Jianxiang Peng","hidden":false},{"_id":"67f40ac23e188ee5436eadc7","name":"Yue Chen","hidden":false},{"_id":"67f40ac23e188ee5436eadc8","name":"Deyi Xiong","hidden":false}],"publishedAt":"2025-02-28T09:12:42.000Z","title":"ProBench: Benchmarking Large Language Models in Competitive Programming","summary":"With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging,\nlarge language models (LLMs) have entered a new phase of development. However,\nexisting benchmarks for coding evaluation are gradually inadequate to assess\nthe capability of advanced LLMs in code reasoning. To bridge the gap for\nhigh-level code reasoning assessment, we propose ProBench to benchmark LLMs in\ncompetitive programming, drawing inspiration from the International Collegiate\nProgramming Contest. ProBench collects a comprehensive set of competitive\nprogramming problems from Codeforces, Luogu, and Nowcoder platforms during the\nperiod from July to December 2024, obtaining real test results through online\nsubmissions to ensure the fairness and accuracy of the evaluation. We establish\na unified problem attribute system, including difficulty grading and algorithm\ntagging. With carefully collected and annotated data in ProBench, we\nsystematically assess 9 latest LLMs in competitive programming across multiple\ndimensions, including thought chain analysis, error type diagnosis, and\nreasoning depth evaluation. Experimental results show that QwQ-32B-Preview\nachieves the best score of 20.93 followed by DeepSeek-V3 with a score of 16.38,\nsuggesting that models trained with specialized reasoning tasks significantly\noutperform general-purpose models (even larger than reasoning-oriented models)\nin programming. Further analysis also reveals key areas for programming\ncapability enhancement, e.g., algorithm adaptability and reasoning sufficiency,\nproviding important insights for the future development of reasoning models.","upvotes":0,"discussionId":"67f40ac33e188ee5436eae2e","ai_summary":"ProBench benchmarks advanced LLMs in competitive programming using real test results, identifying specialized reasoning models as superior and highlighting areas for improvement.","ai_keywords":["large language models","LLMs","competitive programming","International Collegiate Programming Contest","Codeforces","Luogu","Nowcoder","thought chain analysis","error type diagnosis","reasoning depth evaluation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["*"]}">
ProBench benchmarks advanced LLMs in competitive programming using real test results, identifying specialized reasoning models as superior and highlighting areas for improvement.
AI-generated summary
With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging,
large language models (LLMs) have entered a new phase of development. However,
existing benchmarks for coding evaluation are gradually inadequate to assess
the capability of advanced LLMs in code reasoning. To bridge the gap for
high-level code reasoning assessment, we propose ProBench to benchmark LLMs in
competitive programming, drawing inspiration from the International Collegiate
Programming Contest. ProBench collects a comprehensive set of competitive
programming problems from Codeforces, Luogu, and Nowcoder platforms during the
period from July to December 2024, obtaining real test results through online
submissions to ensure the fairness and accuracy of the evaluation. We establish
a unified problem attribute system, including difficulty grading and algorithm
tagging. With carefully collected and annotated data in ProBench, we
systematically assess 9 latest LLMs in competitive programming across multiple
dimensions, including thought chain analysis, error type diagnosis, and
reasoning depth evaluation. Experimental results show that QwQ-32B-Preview
achieves the best score of 20.93 followed by DeepSeek-V3 with a score of 16.38,
suggesting that models trained with specialized reasoning tasks significantly
outperform general-purpose models (even larger than reasoning-oriented models)
in programming. Further analysis also reveals key areas for programming
capability enhancement, e.g., algorithm adaptability and reasoning sufficiency,
providing important insights for the future development of reasoning models.