Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - LLaVA-o1: Let Vision Language Models Reason Step-by-Step
[go: Go Back, main page]

https://huggingface.co/docs/hub/models-uploading

\n","updatedAt":"2024-11-18T05:15:51.813Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9178,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9319192171096802},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[{"reaction":"🔥","users":["AdinaY","elsatch","urosh","Lyte","benmkor","khanhnamle1994","nielsr","xuanthuyvo","felfri","sunlichao137"],"count":10},{"reaction":"👍","users":["deepkyu","saylolsay","juansgv","Narayan97","sugatoray","sunlichao137"],"count":6}],"isReport":false,"parentCommentId":"673aa0cdabddf84949e9230b"}},{"id":"6741da40f7ff3d28ec6b0d90","author":{"_id":"63e992cdccae1fe5c6222f84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e992cdccae1fe5c6222f84/IvksSUf2DENfUwCZSmNPd.jpeg","fullname":"Guowei Xu","name":"Xkev","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":45,"isUserFollowing":false},"createdAt":"2024-11-23T13:36:00.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"model: https://huggingface.co/Xkev/Llama-3.2V-11B-cot\nGradio: https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot","html":"

model: https://huggingface.co/Xkev/Llama-3.2V-11B-cot
Gradio: https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot

\n","updatedAt":"2024-11-23T13:36:00.576Z","author":{"_id":"63e992cdccae1fe5c6222f84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e992cdccae1fe5c6222f84/IvksSUf2DENfUwCZSmNPd.jpeg","fullname":"Guowei Xu","name":"Xkev","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":45,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3632684648036957},"editors":["Xkev"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63e992cdccae1fe5c6222f84/IvksSUf2DENfUwCZSmNPd.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"673aa0cdabddf84949e9230b"}}]},{"id":"673acde0133723854cfcf6f7","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9178,"isUserFollowing":false},"createdAt":"2024-11-18T05:17:20.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"github: https://github.com/PKU-YuanGroup/LLaVA-o1","html":"

github: https://github.com/PKU-YuanGroup/LLaVA-o1

\n","updatedAt":"2024-11-18T05:17:20.111Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9178,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6108227968215942},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[{"reaction":"🚀","users":["NeoByBy","sugatoray"],"count":2}],"isReport":false}},{"id":"673b461521972b9d6f49571c","author":{"_id":"66b5a41ab88ff1e38e20a88e","avatarUrl":"/avatars/c0920c086677a4b681fb7a3b635142c2.svg","fullname":"Yechan Do","name":"royalprogram","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-11-18T13:50:13.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2024-11-18T14:46:20.021Z","author":{"_id":"66b5a41ab88ff1e38e20a88e","avatarUrl":"/avatars/c0920c086677a4b681fb7a3b635142c2.svg","fullname":"Yechan Do","name":"royalprogram","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"673b77fa41d69ace67c24371","author":{"_id":"6486638da4cf2081f20c40ec","avatarUrl":"/avatars/0bc16a7447cd71ac18828a678313bd83.svg","fullname":"Mike Young","name":"mikelabs","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false},"createdAt":"2024-11-18T17:23:06.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Paper summary is here: https://www.aimodels.fyi/papers/arxiv/llava-o1-let-vision-language-models-reason","html":"

Paper summary is here: https://www.aimodels.fyi/papers/arxiv/llava-o1-let-vision-language-models-reason

\n","updatedAt":"2024-11-18T17:23:06.163Z","author":{"_id":"6486638da4cf2081f20c40ec","avatarUrl":"/avatars/0bc16a7447cd71ac18828a678313bd83.svg","fullname":"Mike Young","name":"mikelabs","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4778420329093933},"editors":["mikelabs"],"editorAvatarUrls":["/avatars/0bc16a7447cd71ac18828a678313bd83.svg"],"reactions":[],"isReport":false}},{"id":"673beb125373479538f56db2","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-11-19T01:34:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding](https://huggingface.co/papers/2411.04282) (2024)\n* [Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning](https://huggingface.co/papers/2410.05928) (2024)\n* [Vision-Language Models Can Self-Improve Reasoning via Reflection](https://huggingface.co/papers/2411.00855) (2024)\n* [Large Language Models Can Self-Improve in Long-context Reasoning](https://huggingface.co/papers/2411.08147) (2024)\n* [Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark](https://huggingface.co/papers/2410.14702) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-11-19T01:34:10.267Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7368218302726746},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"69481f51c1489aacdc7c4ff1","author":{"_id":"68cda43594900fc33ae5a8cd","avatarUrl":"/avatars/769054ac16288bb3e993555a9800e662.svg","fullname":"Quity Bon","name":"Quit20100","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-12-21T16:24:49.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":":)","html":"

:)

\n","updatedAt":"2025-12-21T16:25:18.794Z","author":{"_id":"68cda43594900fc33ae5a8cd","avatarUrl":"/avatars/769054ac16288bb3e993555a9800e662.svg","fullname":"Quity Bon","name":"Quit20100","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"uk","probability":0.9769436717033386},"editors":["Quit20100"],"editorAvatarUrls":["/avatars/769054ac16288bb3e993555a9800e662.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2411.10440","authors":[{"_id":"673aa0816c12c4b98bda19ea","user":{"_id":"63e992cdccae1fe5c6222f84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e992cdccae1fe5c6222f84/IvksSUf2DENfUwCZSmNPd.jpeg","isPro":true,"fullname":"Guowei Xu","user":"Xkev","type":"user"},"name":"Guowei Xu","status":"claimed_verified","statusLastChangedAt":"2024-11-18T07:47:22.405Z","hidden":false},{"_id":"673aa0816c12c4b98bda19eb","user":{"_id":"651585ea18cfed0d30bee586","avatarUrl":"/avatars/579e468334102472d870875fe40302e6.svg","isPro":false,"fullname":"Peng Jin","user":"Chat-UniVi","type":"user"},"name":"Peng Jin","status":"claimed_verified","statusLastChangedAt":"2024-11-18T15:19:57.329Z","hidden":false},{"_id":"673aa0816c12c4b98bda19ec","user":{"_id":"6678e670a2873f979b492c5b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6678e670a2873f979b492c5b/qlhNpwbbfL00SdpnFhUnn.png","isPro":false,"fullname":"HaoLi","user":"OzymandisLi","type":"user"},"name":"Li Hao","status":"claimed_verified","statusLastChangedAt":"2025-09-12T16:27:13.289Z","hidden":false},{"_id":"673aa0816c12c4b98bda19ed","user":{"_id":"62c51800cb7033fd49b8efb7","avatarUrl":"/avatars/06c2be0015f8022f9912f2279f2b3597.svg","isPro":false,"fullname":"Song","user":"Yibing","type":"user"},"name":"Yibing Song","status":"admin_assigned","statusLastChangedAt":"2024-11-18T07:49:29.520Z","hidden":false},{"_id":"673aa0816c12c4b98bda19ee","user":{"_id":"65a52766215aabac489e3468","avatarUrl":"/avatars/fe05e22cd7e12e961296426434e17c76.svg","isPro":false,"fullname":"Lichao Sun","user":"sunlichao137","type":"user"},"name":"Lichao Sun","status":"admin_assigned","statusLastChangedAt":"2024-11-18T07:49:20.178Z","hidden":false},{"_id":"673aa0816c12c4b98bda19ef","user":{"_id":"614030bd8cbdb613b82f36a8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1631596713749-noauth.jpeg","isPro":false,"fullname":"Li Yuan","user":"LiYuan","type":"user"},"name":"Li Yuan","status":"admin_assigned","statusLastChangedAt":"2024-11-18T07:49:13.527Z","hidden":false}],"publishedAt":"2024-11-15T18:58:31.000Z","submittedOnDailyAt":"2024-11-17T23:35:01.195Z","title":"LLaVA-o1: Let Vision Language Models Reason Step-by-Step","submittedOnDailyBy":{"_id":"63e992cdccae1fe5c6222f84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e992cdccae1fe5c6222f84/IvksSUf2DENfUwCZSmNPd.jpeg","isPro":true,"fullname":"Guowei Xu","user":"Xkev","type":"user"},"summary":"Large language models have demonstrated substantial advancements in reasoning\ncapabilities, particularly through inference-time scaling, as illustrated by\nmodels such as OpenAI's o1. However, current Vision-Language Models (VLMs)\noften struggle to perform systematic and structured reasoning, especially when\nhandling complex visual question-answering tasks. In this work, we introduce\nLLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning.\nUnlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential\nstages of summarization, visual interpretation, logical reasoning, and\nconclusion generation. This structured approach enables LLaVA-o1 to achieve\nmarked improvements in precision on reasoning-intensive tasks. To accomplish\nthis, we compile the LLaVA-o1-100k dataset, integrating samples from various\nvisual question answering sources and providing structured reasoning\nannotations. Besides, we propose an inference-time stage-level beam search\nmethod, which enables effective inference-time scaling. Remarkably, with only\n100k training samples and a simple yet effective inference time scaling method,\nLLaVA-o1 not only outperforms its base model by 8.9% on a wide range of\nmultimodal reasoning benchmarks, but also surpasses the performance of larger\nand even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and\nLlama-3.2-90B-Vision-Instruct.","upvotes":129,"discussionId":"673aa0826c12c4b98bda1a43","projectPage":"https://github.com/PKU-YuanGroup/LLaVA-CoT","githubRepo":"https://github.com/PKU-YuanGroup/LLaVA-CoT","githubRepoAddedBy":"user","ai_summary":"LLaVA-o1, a novel Vision-Language Model, enhances reasoning capabilities through a structured multistage approach, outperforming larger models on multimodal reasoning benchmarks using a limited dataset and efficient inference-time scaling.","ai_keywords":["Vision-Language Models","multistage reasoning","chain-of-thought prompting","structured reasoning","LLaVA-o1","LLaVA-o1-100k dataset","inference-time stage-level beam search","multimodal reasoning benchmarks"],"githubStars":2125},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63e992cdccae1fe5c6222f84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e992cdccae1fe5c6222f84/IvksSUf2DENfUwCZSmNPd.jpeg","isPro":true,"fullname":"Guowei Xu","user":"Xkev","type":"user"},{"_id":"639ae2dfa2986a9f35278f5a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639ae2dfa2986a9f35278f5a/nY7jifcTUSrkT9uoSlSv4.jpeg","isPro":false,"fullname":"Tan Minh Tran","user":"minhtt32","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"630412d57373aacccd88af95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670594087059-630412d57373aacccd88af95.jpeg","isPro":true,"fullname":"Yasunori Ozaki","user":"alfredplpl","type":"user"},{"_id":"646fce0528638f11a83ee890","avatarUrl":"/avatars/6bbe81608f9fb82506dec7cbd182d94b.svg","isPro":false,"fullname":"Hristo Panev","user":"hppdqdq","type":"user"},{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},{"_id":"64c1c77c245c55a21c6f5a13","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c1c77c245c55a21c6f5a13/d9zlSksf3TxWpBbb-r0fd.jpeg","isPro":false,"fullname":"Reza Sayar","user":"Reza2kn","type":"user"},{"_id":"63082bb7bc0a2a5ee2253523","avatarUrl":"/avatars/6cf8d12d16d15db1070fbea89b5b3967.svg","isPro":false,"fullname":"Kuo-Hsin Tu","user":"dapumptu","type":"user"},{"_id":"628286a23486d6b46b55bcc1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/628286a23486d6b46b55bcc1/LqUJlenUdD4Oa40YEG6Vc.jpeg","isPro":false,"fullname":"DROR HILMAN","user":"drorhilman","type":"user"},{"_id":"632d556865637808725300a0","avatarUrl":"/avatars/ae4f4e56afdc2781ddc3b6bae67c08a0.svg","isPro":false,"fullname":"Patryk Binkowski","user":"ismu","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
Papers
arxiv:2411.10440

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Published on Nov 15, 2024
· Submitted by
Guowei Xu
on Nov 17, 2024
#1 Paper of the day

Abstract

LLaVA-o1, a novel Vision-Language Model, enhances reasoning capabilities through a structured multistage approach, outperforming larger models on multimodal reasoning benchmarks using a limited dataset and efficient inference-time scaling.

AI-generated summary

Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-o1 to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-o1-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

Community

Paper author Paper submitter

In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning like GPT-o1. Our 11B model outperforms Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The key is training on structured data and a novel inference time scaling method—stage-level beam search

·

congrats, would be great to upload the model, here is the guide: https://huggingface.co/docs/hub/models-uploading

This comment has been hidden

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 5

Browse 5 datasets citing this paper

Spaces citing this paper 4

Collections including this paper 40