Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Challenging the Boundaries of Reasoning: An Olympiad-Level Math
Benchmark for Large Language Models
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-03-29T01:34:06.527Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7035282850265503},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.21380","authors":[{"_id":"67e5f4ad147ee85622ad0df1","user":{"_id":"65df408822d66a997b4d5f6e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df408822d66a997b4d5f6e/poROuCSvB39NZSiLzxLZf.jpeg","isPro":false,"fullname":"Haoxiang Sun","user":"CoderBak","type":"user"},"name":"Haoxiang Sun","status":"admin_assigned","statusLastChangedAt":"2025-03-28T08:46:46.418Z","hidden":false},{"_id":"67e5f4ad147ee85622ad0df2","user":{"_id":"6703ac76ea890f0ca5b225eb","avatarUrl":"/avatars/5f56c49a1940143d47dd484782a4abbf.svg","isPro":false,"fullname":"Yingqian Min","user":"EliverQ","type":"user"},"name":"Yingqian Min","status":"claimed_verified","statusLastChangedAt":"2025-03-28T08:38:03.334Z","hidden":false},{"_id":"67e5f4ad147ee85622ad0df3","user":{"_id":"629b765ce1af194c641fcbc6","avatarUrl":"/avatars/7c53a4c2a1e528c19641a2b601731754.svg","isPro":false,"fullname":"Zhipeng Chen","user":"TimothyCzp","type":"user"},"name":"Zhipeng Chen","status":"claimed_verified","statusLastChangedAt":"2025-03-28T08:38:01.177Z","hidden":false},{"_id":"67e5f4ad147ee85622ad0df4","name":"Wayne Xin Zhao","hidden":false},{"_id":"67e5f4ad147ee85622ad0df5","name":"Zheng Liu","hidden":false},{"_id":"67e5f4ad147ee85622ad0df6","name":"Zhongyuan Wang","hidden":false},{"_id":"67e5f4ad147ee85622ad0df7","name":"Lei Fang","hidden":false},{"_id":"67e5f4ad147ee85622ad0df8","user":{"_id":"64b8c89052b7353d8c6a1013","avatarUrl":"/avatars/cd59fffe81f6b07b4519540b8ff3d95f.svg","isPro":false,"fullname":"Ji-Rong Wen","user":"jrwen","type":"user"},"name":"Ji-Rong Wen","status":"admin_assigned","statusLastChangedAt":"2025-03-28T08:47:20.525Z","hidden":false}],"publishedAt":"2025-03-27T11:20:17.000Z","submittedOnDailyAt":"2025-03-28T01:15:16.769Z","title":"Challenging the Boundaries of Reasoning: An Olympiad-Level Math\n Benchmark for Large Language Models","submittedOnDailyBy":{"_id":"648e6a4567aa8ab0e0e4c30f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648e6a4567aa8ab0e0e4c30f/b9OsE6C0a5iJE1yAlFG-_.jpeg","isPro":false,"fullname":"Beichen Zhang","user":"ToheartZhang","type":"user"},"summary":"In recent years, the rapid development of large reasoning models has resulted\nin the saturation of existing benchmarks for evaluating mathematical reasoning,\nhighlighting the urgent need for more challenging and rigorous evaluation\nframeworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level\nmathematical benchmark, designed to rigorously test the complex reasoning\ncapabilities of LLMs. OlymMATH features 200 meticulously curated problems, each\nmanually verified and available in parallel English and Chinese versions. The\nproblems are systematically organized into two distinct difficulty tiers: (1)\nAIME-level problems (easy) that establish a baseline for mathematical reasoning\nassessment, and (2) significantly more challenging problems (hard) designed to\npush the boundaries of current state-of-the-art models. In our benchmark, these\nproblems span four core mathematical fields, each including a verifiable\nnumerical solution to enable objective, rule-based evaluation. Empirical\nresults underscore the significant challenge presented by OlymMATH, with\nstate-of-the-art models including DeepSeek-R1 and OpenAI's o3-mini\ndemonstrating notably limited accuracy on the hard subset. Furthermore, the\nbenchmark facilitates comprehensive bilingual assessment of mathematical\nreasoning abilities-a critical dimension that remains largely unaddressed in\nmainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark\nat the STILL project: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.","upvotes":38,"discussionId":"67e5f4ae147ee85622ad0e27","githubRepo":"https://github.com/RUCAIBox/OlymMATH","githubRepoAddedBy":"user","ai_summary":"OlymMATH is a new benchmark for evaluating the complex reasoning capabilities of LLMs using challenging Olympiad-level problems in parallel English and Chinese.","ai_keywords":["LLMs","DeepSeek-R1","o3-mini","bilingual assessment","mathematical reasoning benchmark","AIME-level problems"],"githubStars":23},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"648e6a4567aa8ab0e0e4c30f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648e6a4567aa8ab0e0e4c30f/b9OsE6C0a5iJE1yAlFG-_.jpeg","isPro":false,"fullname":"Beichen Zhang","user":"ToheartZhang","type":"user"},{"_id":"6703ac76ea890f0ca5b225eb","avatarUrl":"/avatars/5f56c49a1940143d47dd484782a4abbf.svg","isPro":false,"fullname":"Yingqian Min","user":"EliverQ","type":"user"},{"_id":"61d9527294f0d731ab0b0826","avatarUrl":"/avatars/a76b5dd26ea571c02bbe396a162b1544.svg","isPro":false,"fullname":"Xiaolei Wang","user":"wxl","type":"user"},{"_id":"6599f69617edd1f0537ee8d8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/GpIurlha91mUA-u1FDSnl.jpeg","isPro":false,"fullname":"Yan Gao","user":"O2iginal","type":"user"},{"_id":"66feabce98f30194f8b0c223","avatarUrl":"/avatars/d888eaaac303ddacd5826bb2434d9bb2.svg","isPro":false,"fullname":"Teaven K","user":"Teaven","type":"user"},{"_id":"65180fa499cf1a4e3913a31b","avatarUrl":"/avatars/a321b308336bb881b11b396e210226cf.svg","isPro":false,"fullname":"zican dong","user":"cjgs20017","type":"user"},{"_id":"637e4790aa60aba6d5a5ac0e","avatarUrl":"/avatars/4988fca4454537da7f2a34c2c7dbb20c.svg","isPro":false,"fullname":"Yuhao Wang","user":"yhao-wang","type":"user"},{"_id":"673f511468595672b8e08972","avatarUrl":"/avatars/f0553fc412ec631c1cd9890da6917daa.svg","isPro":false,"fullname":"Junyi Li","user":"turboliuyuan","type":"user"},{"_id":"651a29d566e78720a78317ec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/651a29d566e78720a78317ec/WKPcw6Ziqjl44pkrHtCVa.jpeg","isPro":false,"fullname":"Jie Chen","user":"survivi","type":"user"},{"_id":"629b765ce1af194c641fcbc6","avatarUrl":"/avatars/7c53a4c2a1e528c19641a2b601731754.svg","isPro":false,"fullname":"Zhipeng Chen","user":"TimothyCzp","type":"user"},{"_id":"67d019ea9dd49cbca01f39cc","avatarUrl":"/avatars/b990b3ede32dfd94d34e68b2f7e4bff6.svg","isPro":false,"fullname":"jj","user":"leojjzhang","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
OlymMATH is a new benchmark for evaluating the complex reasoning capabilities of LLMs using challenging Olympiad-level problems in parallel English and Chinese.
AI-generated summary
In recent years, the rapid development of large reasoning models has resulted
in the saturation of existing benchmarks for evaluating mathematical reasoning,
highlighting the urgent need for more challenging and rigorous evaluation
frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level
mathematical benchmark, designed to rigorously test the complex reasoning
capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each
manually verified and available in parallel English and Chinese versions. The
problems are systematically organized into two distinct difficulty tiers: (1)
AIME-level problems (easy) that establish a baseline for mathematical reasoning
assessment, and (2) significantly more challenging problems (hard) designed to
push the boundaries of current state-of-the-art models. In our benchmark, these
problems span four core mathematical fields, each including a verifiable
numerical solution to enable objective, rule-based evaluation. Empirical
results underscore the significant challenge presented by OlymMATH, with
state-of-the-art models including DeepSeek-R1 and OpenAI's o3-mini
demonstrating notably limited accuracy on the hard subset. Furthermore, the
benchmark facilitates comprehensive bilingual assessment of mathematical
reasoning abilities-a critical dimension that remains largely unaddressed in
mainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark
at the STILL project: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.
We are happy to share our new mathematical benchmarks, which is really challenging for the reasoning models, even o3-mini demonstrating notably limited accuracy on it (~ 30% accuracy).
We hope this benchmark can better assess the reasoning ability of the LLMs.