https://github.com/dongyh20/Insight-V\n","updatedAt":"2024-11-22T03:16:42.687Z","author":{"_id":"652965773a416e1f2173443b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652965773a416e1f2173443b/y9MB8YgHzbwCXAc4EI9T3.jpeg","fullname":"Yuhao Dong","name":"THUdyh","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":50,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8217601776123047},"editors":["THUdyh"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/652965773a416e1f2173443b/y9MB8YgHzbwCXAc4EI9T3.jpeg"],"reactions":[],"isReport":false}},{"id":"674131071362439725164d6a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-11-23T01:33:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LLaVA-o1: Let Vision Language Models Reason Step-by-Step](https://huggingface.co/papers/2411.10440) (2024)\n* [Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding](https://huggingface.co/papers/2411.04282) (2024)\n* [Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus](https://huggingface.co/papers/2411.12498) (2024)\n* [VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning](https://huggingface.co/papers/2410.22995) (2024)\n* [Let's Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models](https://huggingface.co/papers/2410.21728) (2024)\n* [ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom](https://huggingface.co/papers/2410.14138) (2024)\n* [Proceedings of the First International Workshop on Next-Generation Language Models for Knowledge Representation and Reasoning (NeLaMKRR 2024)](https://huggingface.co/papers/2410.05339) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
\n
\n
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-11-23T01:33:59.853Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7270460724830627},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2411.14432","authors":[{"_id":"673ff66bbf2e9c35b20570fe","user":{"_id":"652965773a416e1f2173443b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652965773a416e1f2173443b/y9MB8YgHzbwCXAc4EI9T3.jpeg","isPro":true,"fullname":"Yuhao Dong","user":"THUdyh","type":"user"},"name":"Yuhao Dong","status":"claimed_verified","statusLastChangedAt":"2024-11-22T10:01:57.737Z","hidden":false},{"_id":"673ff66bbf2e9c35b20570ff","user":{"_id":"64f001bfabd9fb1914398bd5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f001bfabd9fb1914398bd5/9teH82hkBI4csIz_WQh5q.jpeg","isPro":false,"fullname":"liuzuyan","user":"Zuyan","type":"user"},"name":"Zuyan Liu","status":"claimed_verified","statusLastChangedAt":"2024-11-22T10:01:56.110Z","hidden":false},{"_id":"673ff66bbf2e9c35b2057100","user":{"_id":"64cb4e554726a3f8334b75f4","avatarUrl":"/avatars/f00328f69f88dc3f0bcf566e8316612d.svg","isPro":false,"fullname":"Hailong Sun","user":"HailongSun","type":"user"},"name":"Hai-Long Sun","status":"admin_assigned","statusLastChangedAt":"2024-11-22T13:15:29.257Z","hidden":false},{"_id":"673ff66bbf2e9c35b2057101","user":{"_id":"62b5777f593a2c49da69dc02","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658152070753-62b5777f593a2c49da69dc02.jpeg","isPro":false,"fullname":"Jingkang Yang","user":"Jingkang","type":"user"},"name":"Jingkang Yang","status":"admin_assigned","statusLastChangedAt":"2024-11-22T13:16:29.264Z","hidden":false},{"_id":"673ff66bbf2e9c35b2057102","user":{"_id":"63673bb9d0ee6e2662be0ec1","avatarUrl":"/avatars/1b8976785d64bc4e3f7159ccdb7f06c5.svg","isPro":false,"fullname":"Qingqiao Hu","user":"WinstonHu","type":"user"},"name":"Winston Hu","status":"admin_assigned","statusLastChangedAt":"2024-11-22T13:16:37.310Z","hidden":true},{"_id":"673ff66bbf2e9c35b2057103","user":{"_id":"63e4865354f51ea342d45d78","avatarUrl":"/avatars/2e7eccc878751331ca8b282f53e38899.svg","isPro":false,"fullname":"Yongming Rao","user":"raoyongming","type":"user"},"name":"Yongming Rao","status":"admin_assigned","statusLastChangedAt":"2024-11-22T13:16:43.635Z","hidden":false},{"_id":"673ff66bbf2e9c35b2057104","user":{"_id":"62ab1ac1d48b4d8b048a3473","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656826685333-62ab1ac1d48b4d8b048a3473.png","isPro":false,"fullname":"Ziwei Liu","user":"liuziwei7","type":"user"},"name":"Ziwei Liu","status":"admin_assigned","statusLastChangedAt":"2024-11-22T13:16:50.055Z","hidden":false}],"publishedAt":"2024-11-21T18:59:55.000Z","submittedOnDailyAt":"2024-11-22T00:46:42.678Z","title":"Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large\n Language Models","submittedOnDailyBy":{"_id":"652965773a416e1f2173443b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652965773a416e1f2173443b/y9MB8YgHzbwCXAc4EI9T3.jpeg","isPro":true,"fullname":"Yuhao Dong","user":"THUdyh","type":"user"},"summary":"Large Language Models (LLMs) demonstrate enhanced capabilities and\nreliability by reasoning more, evolving from Chain-of-Thought prompting to\nproduct-level solutions like OpenAI o1. Despite various efforts to improve LLM\nreasoning, high-quality long-chain reasoning data and optimized training\npipelines still remain inadequately explored in vision-language tasks. In this\npaper, we present Insight-V, an early effort to 1) scalably produce long and\nrobust reasoning data for complex multi-modal tasks, and 2) an effective\ntraining pipeline to enhance the reasoning capabilities of multi-modal large\nlanguage models (MLLMs). Specifically, to create long and structured reasoning\ndata without human labor, we design a two-step pipeline with a progressive\nstrategy to generate sufficiently long and diverse reasoning paths and a\nmulti-granularity assessment method to ensure data quality. We observe that\ndirectly supervising MLLMs with such long and complex reasoning data will not\nyield ideal reasoning ability. To tackle this problem, we design a multi-agent\nsystem consisting of a reasoning agent dedicated to performing long-chain\nreasoning and a summary agent trained to judge and summarize reasoning results.\nWe further incorporate an iterative DPO algorithm to enhance the reasoning\nagent's generation stability and quality. Based on the popular LLaVA-NeXT model\nand our stronger base MLLM, we demonstrate significant performance gains across\nchallenging multi-modal benchmarks requiring visual reasoning. Benefiting from\nour multi-agent system, Insight-V can also easily maintain or improve\nperformance on perception-focused multi-modal tasks.","upvotes":25,"discussionId":"673ff66cbf2e9c35b2057144","githubRepo":"https://github.com/dongyh20/insight-v","githubRepoAddedBy":"auto","ai_summary":"Insight-V enhances multi-modal large language models through scalable reasoning data generation and a multi-agent system, achieving performance improvements in visual and perceptual tasks.","ai_keywords":["LLMs","reasoning","Chain-of-Thought","OpenAI","vision-language tasks","multi-modal tasks","long-chain reasoning data","training pipeline","multi-agent system","reasoning agent","summary agent","DPO algorithm","LLaVA-NeXT"],"githubStars":233},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"652965773a416e1f2173443b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652965773a416e1f2173443b/y9MB8YgHzbwCXAc4EI9T3.jpeg","isPro":true,"fullname":"Yuhao Dong","user":"THUdyh","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63e4865354f51ea342d45d78","avatarUrl":"/avatars/2e7eccc878751331ca8b282f53e38899.svg","isPro":false,"fullname":"Yongming Rao","user":"raoyongming","type":"user"},{"_id":"64f001bfabd9fb1914398bd5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f001bfabd9fb1914398bd5/9teH82hkBI4csIz_WQh5q.jpeg","isPro":false,"fullname":"liuzuyan","user":"Zuyan","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"63082bb7bc0a2a5ee2253523","avatarUrl":"/avatars/6cf8d12d16d15db1070fbea89b5b3967.svg","isPro":false,"fullname":"Kuo-Hsin Tu","user":"dapumptu","type":"user"},{"_id":"62ab1ac1d48b4d8b048a3473","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656826685333-62ab1ac1d48b4d8b048a3473.png","isPro":false,"fullname":"Ziwei Liu","user":"liuziwei7","type":"user"},{"_id":"63b2a92e18e5cf2cdd333492","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b2a92e18e5cf2cdd333492/GxnngJG0u7d0jYTEFOrfe.png","isPro":false,"fullname":"Jaehyun Jun","user":"btjhjeon","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","isPro":false,"fullname":"Adina Yakefu","user":"AdinaY","type":"user"},{"_id":"63859cf3b2906edaf83af9f0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63859cf3b2906edaf83af9f0/kajwuVzd4pDucSPlwghxo.png","isPro":true,"fullname":"Yuhang Zang","user":"yuhangzang","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large
Language Models
Published on Nov 21, 2024
Abstract
Insight-V enhances multi-modal large language models through scalable reasoning data generation and a multi-agent system, achieving performance improvements in visual and perceptual tasks.
Large Language Models (LLMs) demonstrate enhanced capabilities and
reliability by reasoning more, evolving from Chain-of-Thought prompting to
product-level solutions like OpenAI o1. Despite various efforts to improve LLM
reasoning, high-quality long-chain reasoning data and optimized training
pipelines still remain inadequately explored in vision-language tasks. In this
paper, we present Insight-V, an early effort to 1) scalably produce long and
robust reasoning data for complex multi-modal tasks, and 2) an effective
training pipeline to enhance the reasoning capabilities of multi-modal large
language models (MLLMs). Specifically, to create long and structured reasoning
data without human labor, we design a two-step pipeline with a progressive
strategy to generate sufficiently long and diverse reasoning paths and a
multi-granularity assessment method to ensure data quality. We observe that
directly supervising MLLMs with such long and complex reasoning data will not
yield ideal reasoning ability. To tackle this problem, we design a multi-agent
system consisting of a reasoning agent dedicated to performing long-chain
reasoning and a summary agent trained to judge and summarize reasoning results.
We further incorporate an iterative DPO algorithm to enhance the reasoning
agent's generation stability and quality. Based on the popular LLaVA-NeXT model
and our stronger base MLLM, we demonstrate significant performance gains across
challenging multi-modal benchmarks requiring visual reasoning. Benefiting from
our multi-agent system, Insight-V can also easily maintain or improve
performance on perception-focused multi-modal tasks.