Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation
[go: Go Back, main page]

https://github.com/SeanLeng1/CrossWordBench
Data: https://huggingface.co/datasets/HINT-lab/CrossWordBench

\n","updatedAt":"2025-04-09T02:17:44.473Z","author":{"_id":"64efbf39b3610349e84db417","avatarUrl":"/avatars/9e09a20e88f8cf5ce119efc0dadc3b7b.svg","fullname":"Jiaxin Huang","name":"teapot123","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7781331539154053},"editors":["teapot123"],"editorAvatarUrls":["/avatars/9e09a20e88f8cf5ce119efc0dadc3b7b.svg"],"reactions":[],"isReport":false}},{"id":"67f72010dc096acda5b9473e","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-10T01:34:08.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LR2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems](https://huggingface.co/papers/2502.17848) (2025)\n* [AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models](https://huggingface.co/papers/2502.16906) (2025)\n* [TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning](https://huggingface.co/papers/2502.18431) (2025)\n* [MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems](https://huggingface.co/papers/2503.01891) (2025)\n* [FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving](https://huggingface.co/papers/2502.20238) (2025)\n* [Towards Reasoning Ability of Small Language Models](https://huggingface.co/papers/2502.11569) (2025)\n* [Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities](https://huggingface.co/papers/2502.11829) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-10T01:34:08.758Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7056304216384888},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.00043","authors":[{"_id":"67ec9d4ad327ed17ec707488","user":{"_id":"64811214cacb1c4a06988bd9","avatarUrl":"/avatars/2b0152ef26664c69b6652d44830ada77.svg","isPro":true,"fullname":"Jixuan Leng","user":"JixuanLeng","type":"user"},"name":"Jixuan Leng","status":"admin_assigned","statusLastChangedAt":"2025-04-09T16:47:14.691Z","hidden":false},{"_id":"67ec9d4ad327ed17ec707489","user":{"_id":"62ea79dd01ed9b0e8f61ccd3","avatarUrl":"/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg","isPro":false,"fullname":"Chengsong Huang","user":"ChengsongHuang","type":"user"},"name":"Chengsong Huang","status":"admin_assigned","statusLastChangedAt":"2025-04-09T16:47:27.410Z","hidden":false},{"_id":"67ec9d4ad327ed17ec70748a","name":"Langlin Huang","hidden":false},{"_id":"67ec9d4ad327ed17ec70748b","user":{"_id":"607f666a4ad99100d63ce35c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/607f666a4ad99100d63ce35c/QxhxnvfeV6efkxwUFHwjI.png","isPro":false,"fullname":"Bill Yuchen Lin","user":"yuchenlin","type":"user"},"name":"Bill Yuchen Lin","status":"admin_assigned","statusLastChangedAt":"2025-04-09T16:47:44.967Z","hidden":false},{"_id":"67ec9d4ad327ed17ec70748c","user":{"_id":"66a28706d3449709d6943ef2","avatarUrl":"/avatars/804f45be500d40e1c0e973c691dd2c73.svg","isPro":false,"fullname":"William Cohen","user":"wwcohen","type":"user"},"name":"William W. Cohen","status":"admin_assigned","statusLastChangedAt":"2025-04-09T16:48:09.463Z","hidden":false},{"_id":"67ec9d4ad327ed17ec70748d","user":{"_id":"670d920206218f52e8ea376d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/j_-zv1ftLCJbjjw9-wnN8.png","isPro":false,"fullname":"Haohan Wang","user":"haohanw","type":"user"},"name":"Haohan Wang","status":"admin_assigned","statusLastChangedAt":"2025-04-09T16:48:23.506Z","hidden":false},{"_id":"67ec9d4ad327ed17ec70748e","name":"Jiaxin Huang","hidden":false}],"publishedAt":"2025-03-30T20:03:36.000Z","submittedOnDailyAt":"2025-04-09T00:47:44.460Z","title":"CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs\n with Controllable Puzzle Generation","submittedOnDailyBy":{"_id":"64efbf39b3610349e84db417","avatarUrl":"/avatars/9e09a20e88f8cf5ce119efc0dadc3b7b.svg","isPro":false,"fullname":"Jiaxin Huang","user":"teapot123","type":"user"},"summary":"Existing reasoning evaluation frameworks for Large Language Models (LLMs) and\nLarge Vision-Language Models (LVLMs) predominantly either assess text-based\nreasoning or vision-language understanding capabilities, with limited dynamic\ninterplay between textual and visual constraints. To address this limitation,\nwe introduce CrossWordBench, a benchmark designed to evaluate the reasoning\ncapabilities of both LLMs and LVLMs through the medium of crossword puzzles-a\ntask requiring multimodal adherence to semantic constraints from text-based\nclues and intersectional constraints from visual grid structures.\nCrossWordBench leverages a controllable puzzle generation framework that\nproduces puzzles in multiple formats (text and image) and offers different\nevaluation strategies ranging from direct puzzle solving to interactive modes.\nOur extensive evaluation of over 20 models reveals that reasoning LLMs\noutperform non-reasoning models substantially by effectively leveraging\ncrossing-letter constraints. We further demonstrate that LVLMs struggle with\nthe task, showing a strong correlation between their puzzle-solving performance\nand grid-parsing accuracy. Our findings offer insights into the limitations of\nthe reasoning capabilities of current LLMs and LVLMs, and provide an effective\napproach for creating multimodal constrained tasks for future evaluations.","upvotes":10,"discussionId":"67ec9d4fd327ed17ec707598","githubRepo":"https://github.com/SeanLeng1/CrossWordBench","githubRepoAddedBy":"auto","ai_summary":"CrossWordBench evaluates reasoning capabilities of LLMs and LVLMs using crossword puzzles, highlighting differences in textual and visual constraint handling.","ai_keywords":["CrossWordBench","Large Language Models","Large Vision-Language Models","multimodal adherence","semantic constraints","visual grid structures","controllable puzzle generation","crossing-letter constraints","grid-parsing accuracy"],"githubStars":12},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64efbf39b3610349e84db417","avatarUrl":"/avatars/9e09a20e88f8cf5ce119efc0dadc3b7b.svg","isPro":false,"fullname":"Jiaxin Huang","user":"teapot123","type":"user"},{"_id":"62ea79dd01ed9b0e8f61ccd3","avatarUrl":"/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg","isPro":false,"fullname":"Chengsong Huang","user":"ChengsongHuang","type":"user"},{"_id":"65e02d89574e5aa0e9ce3efa","avatarUrl":"/avatars/2ab152a10b21d81fb1defc726b8e951a.svg","isPro":false,"fullname":"Langlin Huang","user":"shrango","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"66446d2fe7ca43b97c6f41fe","avatarUrl":"/avatars/b9936fd8bce78160bfa362e26594001c.svg","isPro":false,"fullname":"Boyuan Chen","user":"BoyuanChen","type":"user"},{"_id":"64811214cacb1c4a06988bd9","avatarUrl":"/avatars/2b0152ef26664c69b6652d44830ada77.svg","isPro":true,"fullname":"Jixuan Leng","user":"JixuanLeng","type":"user"},{"_id":"651c80a26ba9ab9b9582c273","avatarUrl":"/avatars/e963452eafd21f517d800f2e58e0f918.svg","isPro":false,"fullname":"siyeng feng","user":"siyengfeng","type":"user"},{"_id":"663ccbff3a74a20189d4aa2e","avatarUrl":"/avatars/83a54455e0157480f65c498cd9057cf2.svg","isPro":false,"fullname":"Nguyen Van Thanh","user":"NguyenVanThanhHust","type":"user"},{"_id":"64d9d1e80f992cf1cea308bb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d9d1e80f992cf1cea308bb/zosqW7iHDSxOA8SJtB7V-.jpeg","isPro":false,"fullname":"zhao xu","user":"zhaoxu98","type":"user"},{"_id":"66fa9f0edf4d7ebc64ba64b4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66fa9f0edf4d7ebc64ba64b4/j680v7qhjpk1R_rxcb-L8.jpeg","isPro":false,"fullname":"Daniel","user":"LighterDarkness","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2504.00043

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

Published on Mar 30, 2025
· Submitted by
Jiaxin Huang
on Apr 9, 2025
Authors:
,

Abstract

CrossWordBench evaluates reasoning capabilities of LLMs and LVLMs using crossword puzzles, highlighting differences in textual and visual constraint handling.

AI-generated summary

Existing reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly either assess text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles-a task requiring multimodal adherence to semantic constraints from text-based clues and intersectional constraints from visual grid structures. CrossWordBench leverages a controllable puzzle generation framework that produces puzzles in multiple formats (text and image) and offers different evaluation strategies ranging from direct puzzle solving to interactive modes. Our extensive evaluation of over 20 models reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints. We further demonstrate that LVLMs struggle with the task, showing a strong correlation between their puzzle-solving performance and grid-parsing accuracy. Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.00043 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.00043 in a Space README.md to link it from this page.

Collections including this paper 1