Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges
[go: Go Back, main page]

\n\t\t\n\t\n\t\n\t\tAbstract:\n\t\n\n

Mathematical reasoning in Large Language Models (LLMs) is often evaluated using benchmarks with limited numerical ranges, failing to reflect real-world problem-solving across diverse scales. Furthermore, most existing evaluation methods only compare model outputs to ground-truth answers, obscuring insights into reasoning processes. To address these limitations, we introduce GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales. Additionally, we propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy. Our experiments with various models reveal a significant increase in logical error rates-up to 14 percentage points-as numerical complexity rises, demonstrating a general weakness in reasoning with out-of-distribution numerical values. Moreover, while models demonstrate high accuracy on standalone arithmetic tasks, their performance deteriorates substantially when computations are embedded within word problems. These findings provide a comprehensive evaluation of LLMs' mathematical reasoning capabilities and inform future research directions for improving numerical generalization in language models.

\n","updatedAt":"2025-02-14T14:18:18.449Z","author":{"_id":"64e77b47d96966317b45eeb3","avatarUrl":"/avatars/6b67eba3f15d6cd86ac3ad55c1daf166.svg","fullname":"Minwu Kim","name":"guactastesgood","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8573091626167297},"editors":["guactastesgood"],"editorAvatarUrls":["/avatars/6b67eba3f15d6cd86ac3ad55c1daf166.svg"],"reactions":[],"isReport":false}},{"id":"67afef20af830b220c2831c2","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-02-15T01:34:24.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework](https://huggingface.co/papers/2501.15581) (2025)\n* [ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning](https://huggingface.co/papers/2502.01100) (2025)\n* [UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models](https://huggingface.co/papers/2501.13766) (2025)\n* [Advancing Reasoning in Large Language Models: Promising Methods and Approaches](https://huggingface.co/papers/2502.03671) (2025)\n* [Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective](https://huggingface.co/papers/2501.11110) (2025)\n* [Instantiation-based Formalization of Logical Reasoning Tasks using Language Models and Logical Solvers](https://huggingface.co/papers/2501.16961) (2025)\n* [Large Language Models for Mathematical Analysis](https://huggingface.co/papers/2501.00059) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-02-15T01:34:24.899Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7299790978431702},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.08680","authors":[{"_id":"67af2b2d1f297f2bdacded89","user":{"_id":"64cb922ec7f30fbf7b91a9a7","avatarUrl":"/avatars/457eae5e56b9641ee5543146447d1755.svg","isPro":false,"fullname":"Safal Shrestha","user":"safal312","type":"user"},"name":"Safal Shrestha","status":"claimed_verified","statusLastChangedAt":"2025-02-14T12:04:44.693Z","hidden":false},{"_id":"67af2b2d1f297f2bdacded8a","user":{"_id":"64e77b47d96966317b45eeb3","avatarUrl":"/avatars/6b67eba3f15d6cd86ac3ad55c1daf166.svg","isPro":false,"fullname":"Minwu Kim","user":"guactastesgood","type":"user"},"name":"Minwu Kim","status":"claimed_verified","statusLastChangedAt":"2025-02-14T12:00:37.837Z","hidden":false},{"_id":"67af2b2d1f297f2bdacded8b","name":"Keith Ross","hidden":false}],"publishedAt":"2025-02-12T09:53:10.000Z","submittedOnDailyAt":"2025-02-14T11:48:18.443Z","title":"Mathematical Reasoning in Large Language Models: Assessing Logical and\n Arithmetic Errors across Wide Numerical Ranges","submittedOnDailyBy":{"_id":"64e77b47d96966317b45eeb3","avatarUrl":"/avatars/6b67eba3f15d6cd86ac3ad55c1daf166.svg","isPro":false,"fullname":"Minwu Kim","user":"guactastesgood","type":"user"},"summary":"Mathematical reasoning in Large Language Models (LLMs) is often evaluated\nusing benchmarks with limited numerical ranges, failing to reflect real-world\nproblem-solving across diverse scales. Furthermore, most existing evaluation\nmethods only compare model outputs to ground-truth answers, obscuring insights\ninto reasoning processes. To address these limitations, we introduce\nGSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs\nnumerical values in math problems to assess model robustness across varying\nnumerical scales. Additionally, we propose a novel grading methodology that\ndistinguishes between logical and non-logical errors, offering a more precise\nevaluation of reasoning processes beyond computational accuracy. Our\nexperiments with various models reveal a significant increase in logical error\nrates-up to 14 percentage points-as numerical complexity rises, demonstrating a\ngeneral weakness in reasoning with out-of-distribution numerical values.\nMoreover, while models demonstrate high accuracy on standalone arithmetic\ntasks, their performance deteriorates substantially when computations are\nembedded within word problems. These findings provide a comprehensive\nevaluation of LLMs' mathematical reasoning capabilities and inform future\nresearch directions for improving numerical generalization in language models.","upvotes":11,"discussionId":"67af2b2e1f297f2bdacdedd3","githubRepo":"https://github.com/minwukim/GSM-Ranges","githubRepoAddedBy":"auto","ai_summary":"GSM-Ranges evaluates large language models' mathematical reasoning across diverse numerical scales using perturbed values, revealing weaknesses in handling out-of-distribution numbers and word problems.","ai_keywords":["GSM-Ranges","GSM8K","numerical values","robustness","numerical scales","grading methodology","logical errors","non-logical errors","reasoning processes","out-of-distribution","numerical generalization","word problems"],"githubStars":3},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64cb922ec7f30fbf7b91a9a7","avatarUrl":"/avatars/457eae5e56b9641ee5543146447d1755.svg","isPro":false,"fullname":"Safal Shrestha","user":"safal312","type":"user"},{"_id":"66a0394001cfae79a11b54ea","avatarUrl":"/avatars/e135cca9b9027cfb762e7ff86bd5ab5a.svg","isPro":false,"fullname":"Anubhav Shrestha","user":"xanubhav81","type":"user"},{"_id":"660337bc6d19853ddae3d495","avatarUrl":"/avatars/56009102e3a14c5cdbe6557764e84c87.svg","isPro":false,"fullname":"Nour A","user":"na2247","type":"user"},{"_id":"67b0a4aa8191c180b9421ba7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b0a4aa8191c180b9421ba7/xg3KQqy5nlqbaoIjxxRhJ.png","isPro":false,"fullname":"Taro","user":"KDoyoon","type":"user"},{"_id":"67b0f33e73b4976b630120b4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Aq-YKfh3UEzqJsdKWsEXt.png","isPro":false,"fullname":"Yumi Omori","user":"kudos2025","type":"user"},{"_id":"62f32eab52ad88c930bb3f3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1677134945205-62f32eab52ad88c930bb3f3b.png","isPro":false,"fullname":"Asankhaya Sharma","user":"codelion","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"67b1a0d17c5844c6bcf091db","avatarUrl":"/avatars/bf933b81eb35b529741c1704ad8ef73b.svg","isPro":false,"fullname":"benqi","user":"Aidabenk","type":"user"},{"_id":"67100846171e6d8ec612e44a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/b442635LOSVlWxXaT7bJL.png","isPro":false,"fullname":"Daris Dzakwan Hoesien","user":"darisdzakwanhoesien","type":"user"},{"_id":"64bbe9b236eb058cd9d6a5b9","avatarUrl":"/avatars/c7c01a3fa8809e73800392679abff6d5.svg","isPro":false,"fullname":"Kai Zuberbühler","user":"kaizuberbuehler","type":"user"},{"_id":"64e77b47d96966317b45eeb3","avatarUrl":"/avatars/6b67eba3f15d6cd86ac3ad55c1daf166.svg","isPro":false,"fullname":"Minwu Kim","user":"guactastesgood","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2502.08680

Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges

Published on Feb 12, 2025
· Submitted by
Minwu Kim
on Feb 14, 2025
Authors:

Abstract

GSM-Ranges evaluates large language models' mathematical reasoning across diverse numerical scales using perturbed values, revealing weaknesses in handling out-of-distribution numbers and word problems.

AI-generated summary

Mathematical reasoning in Large Language Models (LLMs) is often evaluated using benchmarks with limited numerical ranges, failing to reflect real-world problem-solving across diverse scales. Furthermore, most existing evaluation methods only compare model outputs to ground-truth answers, obscuring insights into reasoning processes. To address these limitations, we introduce GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales. Additionally, we propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy. Our experiments with various models reveal a significant increase in logical error rates-up to 14 percentage points-as numerical complexity rises, demonstrating a general weakness in reasoning with out-of-distribution numerical values. Moreover, while models demonstrate high accuracy on standalone arithmetic tasks, their performance deteriorates substantially when computations are embedded within word problems. These findings provide a comprehensive evaluation of LLMs' mathematical reasoning capabilities and inform future research directions for improving numerical generalization in language models.

Community

Paper author Paper submitter

Abstract:

Mathematical reasoning in Large Language Models (LLMs) is often evaluated using benchmarks with limited numerical ranges, failing to reflect real-world problem-solving across diverse scales. Furthermore, most existing evaluation methods only compare model outputs to ground-truth answers, obscuring insights into reasoning processes. To address these limitations, we introduce GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales. Additionally, we propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy. Our experiments with various models reveal a significant increase in logical error rates-up to 14 percentage points-as numerical complexity rises, demonstrating a general weakness in reasoning with out-of-distribution numerical values. Moreover, while models demonstrate high accuracy on standalone arithmetic tasks, their performance deteriorates substantially when computations are embedded within word problems. These findings provide a comprehensive evaluation of LLMs' mathematical reasoning capabilities and inform future research directions for improving numerical generalization in language models.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.08680 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.08680 in a Space README.md to link it from this page.

Collections including this paper 3