Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
https://huggingface.co/spaces/WildEval/ZebraLogic\n","updatedAt":"2025-02-04T05:32:03.935Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9177,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8235410451889038},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[{"reaction":"๐","users":["yuchenlin","guoj5","SaylorTwift"],"count":3},{"reaction":"๐ฅ","users":["yuchenlin","guoj5","SaylorTwift"],"count":3},{"reaction":"โค๏ธ","users":["yuchenlin"],"count":1},{"reaction":"๐","users":["yuchenlin"],"count":1},{"reaction":"๐","users":["yuchenlin"],"count":1}],"isReport":false}},{"id":"67a2c025c7caec9bf5a90caa","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false},"createdAt":"2025-02-05T01:34:29.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VERUS-LM: a Versatile Framework for Combining LLMs with Symbolic Reasoning](https://huggingface.co/papers/2501.14540) (2025)\n* [Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning](https://huggingface.co/papers/2412.09078) (2024)\n* [A NotSo Simple Way to Beat Simple Bench](https://huggingface.co/papers/2412.12173) (2024)\n* [Instantiation-based Formalization of Logical Reasoning Tasks using Language Models and Logical Solvers](https://huggingface.co/papers/2501.16961) (2025)\n* [JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models](https://huggingface.co/papers/2501.14851) (2025)\n* [Are Your LLMs Capable of Stable Reasoning?](https://huggingface.co/papers/2412.13147) (2024)\n* [LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs](https://huggingface.co/papers/2501.06186) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-02-05T01:34:29.785Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7196462154388428},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.01100","authors":[{"_id":"67a1a649f4aecd0dfc96ebf4","user":{"_id":"607f666a4ad99100d63ce35c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/607f666a4ad99100d63ce35c/QxhxnvfeV6efkxwUFHwjI.png","isPro":false,"fullname":"Bill Yuchen Lin","user":"yuchenlin","type":"user"},"name":"Bill Yuchen Lin","status":"claimed_verified","statusLastChangedAt":"2025-02-04T09:39:17.972Z","hidden":false},{"_id":"67a1a649f4aecd0dfc96ebf5","user":{"_id":"635049104e753c9940fefd71","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635049104e753c9940fefd71/HgR43XIFw3dneY5ufrAE8.jpeg","isPro":false,"fullname":"Ronan Le Bras","user":"ronanlb","type":"user"},"name":"Ronan Le Bras","status":"extracted_pending","statusLastChangedAt":"2025-02-04T05:31:56.722Z","hidden":false},{"_id":"67a1a649f4aecd0dfc96ebf6","user":{"_id":"659c6f5e1d398a238152227d","avatarUrl":"/avatars/dba97fe8cb825102f1eae97104a71f64.svg","isPro":false,"fullname":"Kyle Richardson","user":"yakazimir","type":"user"},"name":"Kyle Richardson","status":"claimed_verified","statusLastChangedAt":"2025-02-27T09:18:11.166Z","hidden":false},{"_id":"67a1a649f4aecd0dfc96ebf7","user":{"_id":"687e87f9b193d0f4c8d605c0","avatarUrl":"/avatars/1e2fa294cbb2017e380fae6695e35db1.svg","isPro":false,"fullname":"Ashish Sabharwal","user":"ashish333","type":"user"},"name":"Ashish Sabharwal","status":"claimed_verified","statusLastChangedAt":"2025-07-22T07:55:21.795Z","hidden":false},{"_id":"67a1a649f4aecd0dfc96ebf8","name":"Radha Poovendran","hidden":false},{"_id":"67a1a649f4aecd0dfc96ebf9","user":{"_id":"64d265cfbe712cda5ab7cc3f","avatarUrl":"/avatars/caab6fa5764a0271552ae589d352b592.svg","isPro":false,"fullname":"Peter Clarke","user":"PeterClarke","type":"user"},"name":"Peter Clark","status":"admin_assigned","statusLastChangedAt":"2025-02-04T17:38:40.973Z","hidden":false},{"_id":"67a1a649f4aecd0dfc96ebfa","user":{"_id":"64d42729f63b01b7f676b176","avatarUrl":"/avatars/52e54bdd6a1fb6c774a40cd70f3d7925.svg","isPro":false,"fullname":"Yejin Choi","user":"yejinchoinka","type":"user"},"name":"Yejin Choi","status":"admin_assigned","statusLastChangedAt":"2025-02-04T17:38:53.950Z","hidden":false}],"publishedAt":"2025-02-03T06:44:49.000Z","submittedOnDailyAt":"2025-02-04T03:02:03.929Z","title":"ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We investigate the logical reasoning capabilities of large language models\n(LLMs) and their scalability in complex non-monotonic reasoning. To this end,\nwe introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM\nreasoning performance on logic grid puzzles derived from constraint\nsatisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with\ncontrollable and quantifiable complexity, facilitating a systematic study of\nthe scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By\nencompassing a broad range of search space complexities and diverse logical\nconstraints, ZebraLogic provides a structured environment to evaluate reasoning\nunder increasing difficulty.\n Our results reveal a significant decline in accuracy as problem complexity\ngrows -- a phenomenon we term the curse of complexity. This limitation persists\neven with larger models and increased inference-time computation, suggesting\ninherent constraints in current LLM reasoning capabilities. Additionally, we\nexplore strategies to enhance logical reasoning, including Best-of-N sampling,\nbacktracking mechanisms, and self-verification prompts. Our findings offer\ncritical insights into the scalability of LLM reasoning, highlight fundamental\nlimitations, and outline potential directions for improvement.","upvotes":21,"discussionId":"67a1a64cf4aecd0dfc96ecb8","projectPage":"https://hf.co/spaces/WildEval/ZebraLogic","ai_summary":"LLMs exhibit decreased accuracy in complex non-monotonic reasoning, as demonstrated through the ZebraLogic framework assessing logic grid puzzles.","ai_keywords":["large language models (LLMs)","non-monotonic reasoning","ZebraLogic","evaluation framework","logic grid puzzles","constraint satisfaction problems (CSPs)","logical reasoning","Best-of-N sampling","backtracking mechanisms","self-verification prompts"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"607f666a4ad99100d63ce35c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/607f666a4ad99100d63ce35c/QxhxnvfeV6efkxwUFHwjI.png","isPro":false,"fullname":"Bill Yuchen Lin","user":"yuchenlin","type":"user"},{"_id":"60107b385ac3e86b3ea4fc34","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1627505688463-60107b385ac3e86b3ea4fc34.jpeg","isPro":true,"fullname":"Daniel van Strien","user":"davanstrien","type":"user"},{"_id":"65c0db0fbda79a18292dfbb7","avatarUrl":"/avatars/1201b8282664c2d8c18beaba2396c03b.svg","isPro":false,"fullname":"Alsu Sagirova","user":"alsu-sagirova","type":"user"},{"_id":"60e50ce5350d181892d5a636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e50ce5350d181892d5a636/qjYikZtE-2kjub-6XG5Kn.jpeg","isPro":false,"fullname":"Ken Tsui","user":"kenhktsui","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"631e14ac473a6825f285e89d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg","isPro":false,"fullname":"Yury Panikov","user":"panikov","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"6444b3135af87c73bbbd7447","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6444b3135af87c73bbbd7447/-WLquJY3E1KZSJbnYUkwD.jpeg","isPro":true,"fullname":"Frank Sommers","user":"fsommers","type":"user"},{"_id":"673e292e4edfc6d3b54ae352","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/iTxV-ZGq52ndjL_bVRuUQ.png","isPro":false,"fullname":"Jiafan Guo","user":"guoj5","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
LLMs exhibit decreased accuracy in complex non-monotonic reasoning, as demonstrated through the ZebraLogic framework assessing logic grid puzzles.
AI-generated summary
We investigate the logical reasoning capabilities of large language models
(LLMs) and their scalability in complex non-monotonic reasoning. To this end,
we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM
reasoning performance on logic grid puzzles derived from constraint
satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with
controllable and quantifiable complexity, facilitating a systematic study of
the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By
encompassing a broad range of search space complexities and diverse logical
constraints, ZebraLogic provides a structured environment to evaluate reasoning
under increasing difficulty.
Our results reveal a significant decline in accuracy as problem complexity
grows -- a phenomenon we term the curse of complexity. This limitation persists
even with larger models and increased inference-time computation, suggesting
inherent constraints in current LLM reasoning capabilities. Additionally, we
explore strategies to enhance logical reasoning, including Best-of-N sampling,
backtracking mechanisms, and self-verification prompts. Our findings offer
critical insights into the scalability of LLM reasoning, highlight fundamental
limitations, and outline potential directions for improvement.