Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
[go: Go Back, main page]

https://huggingface.co/spaces/WildEval/ZebraLogic

\n","updatedAt":"2025-02-04T05:32:03.935Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9177,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8235410451889038},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[{"reaction":"๐Ÿš€","users":["yuchenlin","guoj5","SaylorTwift"],"count":3},{"reaction":"๐Ÿ”ฅ","users":["yuchenlin","guoj5","SaylorTwift"],"count":3},{"reaction":"โค๏ธ","users":["yuchenlin"],"count":1},{"reaction":"๐Ÿ˜Ž","users":["yuchenlin"],"count":1},{"reaction":"๐Ÿ‘","users":["yuchenlin"],"count":1}],"isReport":false}},{"id":"67a2c025c7caec9bf5a90caa","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false},"createdAt":"2025-02-05T01:34:29.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VERUS-LM: a Versatile Framework for Combining LLMs with Symbolic Reasoning](https://huggingface.co/papers/2501.14540) (2025)\n* [Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning](https://huggingface.co/papers/2412.09078) (2024)\n* [A NotSo Simple Way to Beat Simple Bench](https://huggingface.co/papers/2412.12173) (2024)\n* [Instantiation-based Formalization of Logical Reasoning Tasks using Language Models and Logical Solvers](https://huggingface.co/papers/2501.16961) (2025)\n* [JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models](https://huggingface.co/papers/2501.14851) (2025)\n* [Are Your LLMs Capable of Stable Reasoning?](https://huggingface.co/papers/2412.13147) (2024)\n* [LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs](https://huggingface.co/papers/2501.06186) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-02-05T01:34:29.785Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7196462154388428},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.01100","authors":[{"_id":"67a1a649f4aecd0dfc96ebf4","user":{"_id":"607f666a4ad99100d63ce35c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/607f666a4ad99100d63ce35c/QxhxnvfeV6efkxwUFHwjI.png","isPro":false,"fullname":"Bill Yuchen Lin","user":"yuchenlin","type":"user"},"name":"Bill Yuchen Lin","status":"claimed_verified","statusLastChangedAt":"2025-02-04T09:39:17.972Z","hidden":false},{"_id":"67a1a649f4aecd0dfc96ebf5","user":{"_id":"635049104e753c9940fefd71","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635049104e753c9940fefd71/HgR43XIFw3dneY5ufrAE8.jpeg","isPro":false,"fullname":"Ronan Le Bras","user":"ronanlb","type":"user"},"name":"Ronan Le Bras","status":"extracted_pending","statusLastChangedAt":"2025-02-04T05:31:56.722Z","hidden":false},{"_id":"67a1a649f4aecd0dfc96ebf6","user":{"_id":"659c6f5e1d398a238152227d","avatarUrl":"/avatars/dba97fe8cb825102f1eae97104a71f64.svg","isPro":false,"fullname":"Kyle Richardson","user":"yakazimir","type":"user"},"name":"Kyle Richardson","status":"claimed_verified","statusLastChangedAt":"2025-02-27T09:18:11.166Z","hidden":false},{"_id":"67a1a649f4aecd0dfc96ebf7","user":{"_id":"687e87f9b193d0f4c8d605c0","avatarUrl":"/avatars/1e2fa294cbb2017e380fae6695e35db1.svg","isPro":false,"fullname":"Ashish Sabharwal","user":"ashish333","type":"user"},"name":"Ashish Sabharwal","status":"claimed_verified","statusLastChangedAt":"2025-07-22T07:55:21.795Z","hidden":false},{"_id":"67a1a649f4aecd0dfc96ebf8","name":"Radha Poovendran","hidden":false},{"_id":"67a1a649f4aecd0dfc96ebf9","user":{"_id":"64d265cfbe712cda5ab7cc3f","avatarUrl":"/avatars/caab6fa5764a0271552ae589d352b592.svg","isPro":false,"fullname":"Peter Clarke","user":"PeterClarke","type":"user"},"name":"Peter Clark","status":"admin_assigned","statusLastChangedAt":"2025-02-04T17:38:40.973Z","hidden":false},{"_id":"67a1a649f4aecd0dfc96ebfa","user":{"_id":"64d42729f63b01b7f676b176","avatarUrl":"/avatars/52e54bdd6a1fb6c774a40cd70f3d7925.svg","isPro":false,"fullname":"Yejin Choi","user":"yejinchoinka","type":"user"},"name":"Yejin Choi","status":"admin_assigned","statusLastChangedAt":"2025-02-04T17:38:53.950Z","hidden":false}],"publishedAt":"2025-02-03T06:44:49.000Z","submittedOnDailyAt":"2025-02-04T03:02:03.929Z","title":"ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We investigate the logical reasoning capabilities of large language models\n(LLMs) and their scalability in complex non-monotonic reasoning. To this end,\nwe introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM\nreasoning performance on logic grid puzzles derived from constraint\nsatisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with\ncontrollable and quantifiable complexity, facilitating a systematic study of\nthe scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By\nencompassing a broad range of search space complexities and diverse logical\nconstraints, ZebraLogic provides a structured environment to evaluate reasoning\nunder increasing difficulty.\n Our results reveal a significant decline in accuracy as problem complexity\ngrows -- a phenomenon we term the curse of complexity. This limitation persists\neven with larger models and increased inference-time computation, suggesting\ninherent constraints in current LLM reasoning capabilities. Additionally, we\nexplore strategies to enhance logical reasoning, including Best-of-N sampling,\nbacktracking mechanisms, and self-verification prompts. Our findings offer\ncritical insights into the scalability of LLM reasoning, highlight fundamental\nlimitations, and outline potential directions for improvement.","upvotes":21,"discussionId":"67a1a64cf4aecd0dfc96ecb8","projectPage":"https://hf.co/spaces/WildEval/ZebraLogic","ai_summary":"LLMs exhibit decreased accuracy in complex non-monotonic reasoning, as demonstrated through the ZebraLogic framework assessing logic grid puzzles.","ai_keywords":["large language models (LLMs)","non-monotonic reasoning","ZebraLogic","evaluation framework","logic grid puzzles","constraint satisfaction problems (CSPs)","logical reasoning","Best-of-N sampling","backtracking mechanisms","self-verification prompts"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"607f666a4ad99100d63ce35c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/607f666a4ad99100d63ce35c/QxhxnvfeV6efkxwUFHwjI.png","isPro":false,"fullname":"Bill Yuchen Lin","user":"yuchenlin","type":"user"},{"_id":"60107b385ac3e86b3ea4fc34","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1627505688463-60107b385ac3e86b3ea4fc34.jpeg","isPro":true,"fullname":"Daniel van Strien","user":"davanstrien","type":"user"},{"_id":"65c0db0fbda79a18292dfbb7","avatarUrl":"/avatars/1201b8282664c2d8c18beaba2396c03b.svg","isPro":false,"fullname":"Alsu Sagirova","user":"alsu-sagirova","type":"user"},{"_id":"60e50ce5350d181892d5a636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e50ce5350d181892d5a636/qjYikZtE-2kjub-6XG5Kn.jpeg","isPro":false,"fullname":"Ken Tsui","user":"kenhktsui","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"631e14ac473a6825f285e89d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg","isPro":false,"fullname":"Yury Panikov","user":"panikov","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"6444b3135af87c73bbbd7447","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6444b3135af87c73bbbd7447/-WLquJY3E1KZSJbnYUkwD.jpeg","isPro":true,"fullname":"Frank Sommers","user":"fsommers","type":"user"},{"_id":"673e292e4edfc6d3b54ae352","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/iTxV-ZGq52ndjL_bVRuUQ.png","isPro":false,"fullname":"Jiafan Guo","user":"guoj5","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2502.01100

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

Published on Feb 3, 2025
ยท Submitted by
AK
on Feb 4, 2025

Abstract

LLMs exhibit decreased accuracy in complex non-monotonic reasoning, as demonstrated through the ZebraLogic framework assessing logic grid puzzles.

AI-generated summary

We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows -- a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.

Community

Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.01100 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 3

Collections including this paper 5