The following papers were recommended by the Semantic Scholar API
\n- \n
- LLaVA-CoT: Let Vision Language Models Reason Step-by-Step (2024) \n
- Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models (2024) \n
- TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action (2024) \n
- A NotSo Simple Way to Beat Simple Bench (2024) \n
- Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning (2024) \n
- AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning (2024) \n
- DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests (2025) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
Solve this
\n","updatedAt":"2025-01-14T18:04:53.999Z","author":{"_id":"671da721941d8e30b62d587a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/U1D_zeezqPMH9mmXNtkQ2.png","fullname":"VASEEM AKRAM","name":"VASEEM044","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5121124982833862},"editors":["VASEEM044"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/U1D_zeezqPMH9mmXNtkQ2.png"],"reactions":[],"isReport":false,"parentCommentId":"6786a735d2d53a69cb55e2ae"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2501.06186","authors":[{"_id":"6784817c956a6e09449547bd","user":{"_id":"64b7d2ad8c632fbca9507431","avatarUrl":"/avatars/76c31ea218108cf6c3715269f7605404.svg","isPro":false,"fullname":"Omkar Thawakar","user":"omkarthawakar","type":"user"},"name":"Omkar Thawakar","status":"admin_assigned","statusLastChangedAt":"2025-01-13T09:25:42.768Z","hidden":false},{"_id":"6784817c956a6e09449547be","user":{"_id":"6639cd849662bb58d5fe5793","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6639cd849662bb58d5fe5793/PHUa38L5_llipN52ngajY.png","isPro":false,"fullname":"Dissanayake","user":"Dinura","type":"user"},"name":"Dinura Dissanayake","status":"claimed_verified","statusLastChangedAt":"2025-02-04T09:39:49.771Z","hidden":false},{"_id":"6784817c956a6e09449547bf","name":"Ketan More","hidden":false},{"_id":"6784817c956a6e09449547c0","user":{"_id":"664cec3ba016876f526afe7a","avatarUrl":"/avatars/69007f7e7f7ff78069101deb4b354e37.svg","isPro":false,"fullname":"Ritesh Thawkar","user":"Ritesh-hf","type":"user"},"name":"Ritesh Thawkar","status":"admin_assigned","statusLastChangedAt":"2025-01-13T09:26:18.535Z","hidden":false},{"_id":"6784817c956a6e09449547c1","user":{"_id":"656864e12d73834278a8dea7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656864e12d73834278a8dea7/sfAWS2eyPtFHb_2GZIypp.jpeg","isPro":true,"fullname":"Ahmed Heakl","user":"ahmedheakl","type":"user"},"name":"Ahmed Heakl","status":"claimed_verified","statusLastChangedAt":"2025-01-13T08:54:30.443Z","hidden":false},{"_id":"6784817c956a6e09449547c2","name":"Noor Ahsan","hidden":false},{"_id":"6784817c956a6e09449547c3","user":{"_id":"667840df246665be1af09cb5","avatarUrl":"/avatars/c714d3fbb32611704effc7964a248edc.svg","isPro":false,"fullname":"Yuhao Li","user":"UHow","type":"user"},"name":"Yuhao Li","status":"claimed_verified","statusLastChangedAt":"2025-03-10T08:04:27.329Z","hidden":false},{"_id":"6784817c956a6e09449547c4","name":"Mohammed Zumri","hidden":false},{"_id":"6784817c956a6e09449547c5","name":"Jean Lahoud","hidden":false},{"_id":"6784817c956a6e09449547c6","name":"Rao Muhammad Anwer","hidden":false},{"_id":"6784817c956a6e09449547c7","user":{"_id":"654a5f4f9b8bd6406d45bb46","avatarUrl":"/avatars/ac0d7eef62cd98a280b162cf7896b1a2.svg","isPro":false,"fullname":"Hisham Cholakkal","user":"hishamcholakkal","type":"user"},"name":"Hisham Cholakkal","status":"admin_assigned","statusLastChangedAt":"2025-01-13T09:27:47.044Z","hidden":false},{"_id":"6784817c956a6e09449547c8","name":"Ivan Laptev","hidden":false},{"_id":"6784817c956a6e09449547c9","name":"Mubarak Shah","hidden":false},{"_id":"6784817c956a6e09449547ca","name":"Fahad Shahbaz Khan","hidden":false},{"_id":"6784817c956a6e09449547cb","name":"Salman Khan","hidden":false}],"publishedAt":"2025-01-10T18:59:51.000Z","submittedOnDailyAt":"2025-01-13T00:36:55.734Z","title":"LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs","submittedOnDailyBy":{"_id":"656864e12d73834278a8dea7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656864e12d73834278a8dea7/sfAWS2eyPtFHb_2GZIypp.jpeg","isPro":true,"fullname":"Ahmed Heakl","user":"ahmedheakl","type":"user"},"summary":"Reasoning is a fundamental capability for solving complex multi-step\nproblems, particularly in visual contexts where sequential step-wise\nunderstanding is essential. Existing approaches lack a comprehensive framework\nfor evaluating visual reasoning and do not emphasize step-wise problem-solving.\nTo this end, we propose a comprehensive framework for advancing step-by-step\nvisual reasoning in large language models (LMMs) through three key\ncontributions. First, we introduce a visual reasoning benchmark specifically\ndesigned to evaluate multi-step reasoning tasks. The benchmark presents a\ndiverse set of challenges with eight different categories ranging from complex\nvisual perception to scientific reasoning with over 4k reasoning steps in\ntotal, enabling robust evaluation of LLMs' abilities to perform accurate and\ninterpretable visual reasoning across multiple steps. Second, we propose a\nnovel metric that assesses visual reasoning quality at the granularity of\nindividual steps, emphasizing both correctness and logical coherence. The\nproposed metric offers deeper insights into reasoning performance compared to\ntraditional end-task accuracy metrics. Third, we present a new multimodal\nvisual reasoning model, named LlamaV-o1, trained using a multi-step curriculum\nlearning approach, where tasks are progressively organized to facilitate\nincremental skill acquisition and problem-solving. The proposed LlamaV-o1 is\ndesigned for multi-step reasoning and learns step-by-step through a structured\ntraining paradigm. Extensive experiments show that our LlamaV-o1 outperforms\nexisting open-source models and performs favorably against close-source\nproprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an\naverage score of 67.3 with an absolute gain of 3.8\\% across six benchmarks\nwhile being 5 times faster during inference scaling. Our benchmark, model, and\ncode are publicly available.","upvotes":65,"discussionId":"6784817d956a6e0944954821","githubRepo":"https://github.com/mbzuai-oryx/llamav-o1","githubRepoAddedBy":"auto","ai_summary":"A framework for evaluating and improving step-by-step visual reasoning in large language models using a specialized benchmark and a novel multimodal model trained with curriculum learning.","ai_keywords":["visual reasoning","large language models","visual reasoning benchmark","multi-step reasoning tasks","reasoning steps","visual perception","scientific reasoning","visual reasoning quality","multimodal visual reasoning model","LlamaV-o1","curriculum learning","incremental skill acquisition","problem-solving","Llava-CoT","end-task accuracy metrics"],"githubStars":310,"organization":{"_id":"61fb9e24dc607a42af5f193f","name":"MBZUAI","fullname":"Mohamed Bin Zayed University of Artificial Intelligence","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1643879908583-603ab5664a944b99e81476e8.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"656864e12d73834278a8dea7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656864e12d73834278a8dea7/sfAWS2eyPtFHb_2GZIypp.jpeg","isPro":true,"fullname":"Ahmed Heakl","user":"ahmedheakl","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"668cd4bbe990292e5f6974d3","avatarUrl":"/avatars/d1747b2372e94500ecb5fb56809b482d.svg","isPro":false,"fullname":"Jinyeong Kim","user":"rubatoyeong","type":"user"},{"_id":"604714a0c82d59b7347b55ae","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/604714a0c82d59b7347b55ae/WZFKDDUPi8JS0BK8t7mIv.jpeg","isPro":false,"fullname":"YangWang92","user":"yangwang92","type":"user"},{"_id":"62440ccec661f366527d05ed","avatarUrl":"/avatars/851c61780b703b02d86651205989723c.svg","isPro":false,"fullname":"Salman Khan","user":"salmaneme","type":"user"},{"_id":"63c5d43ae2804cb2407e4d43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1673909278097-noauth.png","isPro":false,"fullname":"xziayro","user":"xziayro","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"636f533c1ca0ea5107ed171d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/636f533c1ca0ea5107ed171d/jLwsrcPtUiHj8WhcE0Y67.jpeg","isPro":false,"fullname":"Bhimraj Yadav","user":"bhimrazy","type":"user"},{"_id":"5fa241b4a13e063b8b2b5e2f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fa241b4a13e063b8b2b5e2f/lbrO-eAcRDHqeoTPdMjkR.png","isPro":true,"fullname":"Prince Canuma","user":"prince-canuma","type":"user"},{"_id":"62a0baa7c6754e9bfabf7e76","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1677419323373-62a0baa7c6754e9bfabf7e76.png","isPro":false,"fullname":"Dr. Amit Puri","user":"amitpuri","type":"user"},{"_id":"63470b9f3ea42ee2cb4f3279","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Xv8-IxM4GYM91IUOkRnCG.png","isPro":false,"fullname":"NG","user":"SirRa1zel","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3,"organization":{"_id":"61fb9e24dc607a42af5f193f","name":"MBZUAI","fullname":"Mohamed Bin Zayed University of Artificial Intelligence","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1643879908583-603ab5664a944b99e81476e8.jpeg"}}">LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Abstract
A framework for evaluating and improving step-by-step visual reasoning in large language models using a specialized benchmark and a novel multimodal model trained with curriculum learning.
Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning in large language models (LMMs) through three key contributions. First, we introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks. The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs' abilities to perform accurate and interpretable visual reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end-task accuracy metrics. Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is designed for multi-step reasoning and learns step-by-step through a structured training paradigm. Extensive experiments show that our LlamaV-o1 outperforms existing open-source models and performs favorably against close-source proprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an average score of 67.3 with an absolute gain of 3.8\% across six benchmarks while being 5 times faster during inference scaling. Our benchmark, model, and code are publicly available.
Community
Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning in large language models (LMMs) through three key contributions. First, we introduce a visual reasoning chain benchmark specifically designed to evaluate multi-step reasoning tasks. The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs’ abilities to perform accurate and interpretable visual reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end-task accuracy metrics. Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is designed for multi-step reasoning and learns step-by-step through a structured training paradigm. Extensive experiments show that our LlamaV-o1 outperforms existing open-source models and performs favourably against close-source proprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an average score of 67.3 with an absolute gain of 3.8% across six benchmarks while being 5× faster during inference scaling. Our benchmark, model, and code are publicly available.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LLaVA-CoT: Let Vision Language Models Reason Step-by-Step (2024)
- Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models (2024)
- TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action (2024)
- A NotSo Simple Way to Beat Simple Bench (2024)
- Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning (2024)
- AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning (2024)
- DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend