Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Intriguing Properties of Large Language and Vision Models
[go: Go Back, main page]

https://github.com/passing2961/IP-LLVM

\n","updatedAt":"2024-10-11T04:31:45.538Z","author":{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","fullname":"Young-Jun Lee","name":"passing2961","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"editors":["passing2961"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg"],"reactions":[],"isReport":false}},{"id":"670902d3a45f08b489356e8b","author":{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","fullname":"Young-Jun Lee","name":"passing2961","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"createdAt":"2024-10-11T10:49:55.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"@librarian-bot","html":"

\n\n@librarian-bot\n\t

\n","updatedAt":"2024-10-11T10:50:41.494Z","author":{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","fullname":"Young-Jun Lee","name":"passing2961","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":1,"editors":["passing2961"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg"],"reactions":[],"isReport":false}},{"id":"6709469a836c21a316596ebb","author":{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","fullname":"Byung-Kwan Lee","name":"BK-Lee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":65,"isUserFollowing":false},"createdAt":"2024-10-11T15:39:06.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Interesting work!","html":"

Interesting work!

\n","updatedAt":"2024-10-11T15:39:06.518Z","author":{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","fullname":"Byung-Kwan Lee","name":"BK-Lee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":65,"isUserFollowing":false}},"numEdits":0,"editors":["BK-Lee"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg"],"reactions":[],"isReport":false}},{"id":"6709d21b8fc328d66cf4aba7","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-10-12T01:34:19.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Phantom of Latent for Large Language and Vision Models](https://huggingface.co/papers/2409.14713) (2024)\n* [A Survey on Benchmarks of Multimodal Large Language Models](https://huggingface.co/papers/2408.08632) (2024)\n* [EMMA: Efficient Visual Alignment in Multi-Modal LLMs](https://huggingface.co/papers/2410.02080) (2024)\n* [ParGo: Bridging Vision-Language with Partial and Global Views](https://huggingface.co/papers/2408.12928) (2024)\n* [POINTS: Improving Your Vision-language Model with Affordable Strategies](https://huggingface.co/papers/2409.04828) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-10-12T01:34:19.032Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2410.04751","authors":[{"_id":"6705df47f6f1dfb1734d6ee4","user":{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},"name":"Young-Jun Lee","status":"claimed_verified","statusLastChangedAt":"2024-10-09T07:37:29.247Z","hidden":false},{"_id":"6705df47f6f1dfb1734d6ee5","user":{"_id":"60ff7a87aa025227d344b735","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1627359121375-60ff7a87aa025227d344b735.jpeg","isPro":false,"fullname":"Byungsoo Ko","user":"kobiso","type":"user"},"name":"Byungsoo Ko","status":"admin_assigned","statusLastChangedAt":"2024-10-11T10:34:44.659Z","hidden":false},{"_id":"6705df47f6f1dfb1734d6ee6","user":{"_id":"668644c2c22e4833a634f90f","avatarUrl":"/avatars/dec3f3e8737379567e6cc0d1d27d5ec8.svg","isPro":false,"fullname":"Han-Gyu Kim","user":"mkmiracle","type":"user"},"name":"Han-Gyu Kim","status":"extracted_pending","statusLastChangedAt":"2024-10-09T01:41:28.430Z","hidden":false},{"_id":"6705df47f6f1dfb1734d6ee7","user":{"_id":"653032b83ecbe51d6a6f0498","avatarUrl":"/avatars/3d435a420bf7024560ed5c5aa2e204d6.svg","isPro":false,"fullname":"Yechan Hwang","user":"YYXYmint","type":"user"},"name":"Yechan Hwang","status":"admin_assigned","statusLastChangedAt":"2024-10-14T07:54:02.769Z","hidden":false},{"_id":"6705df47f6f1dfb1734d6ee8","name":"Ho-Jin Choi","hidden":false}],"publishedAt":"2024-10-07T05:07:01.000Z","submittedOnDailyAt":"2024-10-11T03:01:45.531Z","title":"Intriguing Properties of Large Language and Vision Models","submittedOnDailyBy":{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},"summary":"Recently, large language and vision models (LLVMs) have received significant\nattention and development efforts due to their remarkable generalization\nperformance across a wide range of tasks requiring perception and cognitive\nabilities. A key factor behind their success is their simple architecture,\nwhich consists of a vision encoder, a projector, and a large language model\n(LLM). Despite their achievements in advanced reasoning tasks, their\nperformance on fundamental perception-related tasks (e.g., MMVP) remains\nsurprisingly low. This discrepancy raises the question of how LLVMs truly\nperceive images and exploit the advantages of the vision encoder. To address\nthis, we systematically investigate this question regarding several aspects:\npermutation invariance, robustness, math reasoning, alignment preserving and\nimportance, by evaluating the most common LLVM's families (i.e., LLaVA) across\n10 evaluation benchmarks. Our extensive experiments reveal several intriguing\nproperties of current LLVMs: (1) they internally process the image in a global\nmanner, even when the order of visual patch sequences is randomly permuted; (2)\nthey are sometimes able to solve math problems without fully perceiving\ndetailed numerical information; (3) the cross-modal alignment is overfitted to\ncomplex reasoning tasks, thereby, causing them to lose some of the original\nperceptual capabilities of their vision encoder; (4) the representation space\nin the lower layers (<25%) plays a crucial role in determining performance and\nenhancing visual understanding. Lastly, based on the above observations, we\nsuggest potential future directions for building better LLVMs and constructing\nmore challenging evaluation benchmarks.","upvotes":16,"discussionId":"6705df48f6f1dfb1734d6f14","ai_summary":"The investigation into large language and vision models reveals that while they excel in advanced reasoning, they may lack robust perception, as seen through various benchmarks evaluating different aspects like permutation invariance and cross-modal alignment.","ai_keywords":["large language and vision models","LLVMs","vision encoder","projector","large language model","MMVP","permutation invariance","math reasoning","cross-modal alignment","visual patch sequences","visual understanding"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},{"_id":"653032b83ecbe51d6a6f0498","avatarUrl":"/avatars/3d435a420bf7024560ed5c5aa2e204d6.svg","isPro":false,"fullname":"Yechan Hwang","user":"YYXYmint","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"668cd4bbe990292e5f6974d3","avatarUrl":"/avatars/d1747b2372e94500ecb5fb56809b482d.svg","isPro":false,"fullname":"Jinyeong Kim","user":"rubatoyeong","type":"user"},{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","isPro":false,"fullname":"Byung-Kwan Lee","user":"BK-Lee","type":"user"},{"_id":"61e52be53d6dbb1da842316a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61e52be53d6dbb1da842316a/gx0WGPcOCClXPymoKglc4.jpeg","isPro":false,"fullname":"Börje Karlsson","user":"tellarin","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"641aef7b1911d3be67425338","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641aef7b1911d3be67425338/CmCbWWB6NxkAaus59q31w.jpeg","isPro":false,"fullname":"Qi Liu (SJTU & SII)","user":"purewhite42","type":"user"},{"_id":"639a52003344c29aa04a1da2","avatarUrl":"/avatars/278aaa5176ad21756b920f2d841e4af1.svg","isPro":false,"fullname":"havaqa","user":"salmanhavaqa","type":"user"},{"_id":"64d90edd6db135cfc8f707c4","avatarUrl":"/avatars/5b18d128c927f4bf8ad0710ae77989a9.svg","isPro":false,"fullname":"Kyle Tuft","user":"Chilangosta","type":"user"},{"_id":"64587be872b60ae7a3817858","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png","isPro":false,"fullname":"Minbyul Jeong","user":"Minbyul","type":"user"},{"_id":"631c386bc73939ffc0716a37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662793811119-noauth.jpeg","isPro":false,"fullname":"SeongWan Kim","user":"idgmatrix","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2410.04751

Intriguing Properties of Large Language and Vision Models

Published on Oct 7, 2024
· Submitted by
Young-Jun Lee
on Oct 11, 2024
Authors:

Abstract

The investigation into large language and vision models reveals that while they excel in advanced reasoning, they may lack robust perception, as seen through various benchmarks evaluating different aspects like permutation invariance and cross-modal alignment.

AI-generated summary

Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (<25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.

Community

Paper author Paper submitter
Paper author Paper submitter
edited Oct 11, 2024

Interesting work!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.04751 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.04751 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.04751 in a Space README.md to link it from this page.

Collections including this paper 3