Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
[go: Go Back, main page]

https://github.com/QwenLM/Qwen2-VL

\n","updatedAt":"2024-09-19T02:22:02.328Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9179,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8443721532821655},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[{"reaction":"๐Ÿ”ฅ","users":["AdinaY","logicwong","KSa","cataluna84"],"count":4}],"isReport":false}},{"id":"66ecd0fec1d22692ab522c02","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-09-20T01:33:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations](https://huggingface.co/papers/2409.03206) (2024)\n* [VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation](https://huggingface.co/papers/2409.04429) (2024)\n* [Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision&Language Modeling](https://huggingface.co/papers/2409.05395) (2024)\n* [mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models](https://huggingface.co/papers/2408.04840) (2024)\n* [LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models](https://huggingface.co/papers/2408.16224) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-09-20T01:33:50.705Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7582111358642578},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"๐Ÿ”ฅ","users":["psikosen"],"count":1}],"isReport":false}},{"id":"673ec97ea7fada907da9c4d2","author":{"_id":"64a826399d2a35d27e79ce4e","avatarUrl":"/avatars/471deb91fea871d1c0338fe38368f5a6.svg","fullname":"shubham salunke","name":"shubhamhuggingface99","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-11-21T05:47:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"\n![only_claims.png](https://cdn-uploads.huggingface.co/production/uploads/64a826399d2a35d27e79ce4e/G_cUU5d_zTgXt67WwB_oS.png)\n","html":"

\"only_claims.png\"

\n","updatedAt":"2024-11-21T05:47:42.520Z","author":{"_id":"64a826399d2a35d27e79ce4e","avatarUrl":"/avatars/471deb91fea871d1c0338fe38368f5a6.svg","fullname":"shubham salunke","name":"shubhamhuggingface99","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.2455587238073349},"editors":["shubhamhuggingface99"],"editorAvatarUrls":["/avatars/471deb91fea871d1c0338fe38368f5a6.svg"],"reactions":[],"isReport":false}},{"id":"673ec9a6cdad8a9744bebbcf","author":{"_id":"64a826399d2a35d27e79ce4e","avatarUrl":"/avatars/471deb91fea871d1c0338fe38368f5a6.svg","fullname":"shubham salunke","name":"shubhamhuggingface99","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-11-21T05:48:22.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Give me all claim related above product","html":"

Give me all claim related above product

\n","updatedAt":"2024-11-21T05:48:22.748Z","author":{"_id":"64a826399d2a35d27e79ce4e","avatarUrl":"/avatars/471deb91fea871d1c0338fe38368f5a6.svg","fullname":"shubham salunke","name":"shubhamhuggingface99","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7863764762878418},"editors":["shubhamhuggingface99"],"editorAvatarUrls":["/avatars/471deb91fea871d1c0338fe38368f5a6.svg"],"reactions":[],"isReport":false}},{"id":"68bc6e8e2434d2bac7f2b8f0","author":{"_id":"68956d48a4d5ef99c62732ae","avatarUrl":"/avatars/df732de25f6bb0ba2be6fc78cb6075a1.svg","fullname":"Judas Ischarioth","name":"mrschleicher","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-09-06T17:25:34.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2025-09-06T17:33:54.424Z","author":{"_id":"68956d48a4d5ef99c62732ae","avatarUrl":"/avatars/df732de25f6bb0ba2be6fc78cb6075a1.svg","fullname":"Judas Ischarioth","name":"mrschleicher","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"68bc6ebef4346a06c87c44a9","author":{"_id":"68956d48a4d5ef99c62732ae","avatarUrl":"/avatars/df732de25f6bb0ba2be6fc78cb6075a1.svg","fullname":"Judas Ischarioth","name":"mrschleicher","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-09-06T17:26:22.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2025-09-06T17:34:01.835Z","author":{"_id":"68956d48a4d5ef99c62732ae","avatarUrl":"/avatars/df732de25f6bb0ba2be6fc78cb6075a1.svg","fullname":"Judas Ischarioth","name":"mrschleicher","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"editors":[],"editorAvatarUrls":[],"reactions":[]}}],"primaryEmailConfirmed":false,"paper":{"id":"2409.12191","authors":[{"_id":"66eb8aa9fe376db0aa90bb3f","name":"Peng Wang","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb40","user":{"_id":"63451cf0a05b51f7ded25505","avatarUrl":"/avatars/dec4bbee4a82b773fc58dfc2dce9dbeb.svg","isPro":false,"fullname":"shuai bai","user":"ShuaiBai623","type":"user"},"name":"Shuai Bai","status":"admin_assigned","statusLastChangedAt":"2024-09-19T08:16:47.730Z","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb41","user":{"_id":"6337a9bb0267ebcf026ad110","avatarUrl":"/avatars/12a170b28ade8df979067077828d719c.svg","isPro":false,"fullname":"Sinan Tan","user":"tinytangent","type":"user"},"name":"Sinan Tan","status":"admin_assigned","statusLastChangedAt":"2024-09-19T08:17:01.137Z","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb42","name":"Shijie Wang","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb43","name":"Zhihao Fan","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb44","user":{"_id":"60113fad51e116b62cd0a30e","avatarUrl":"/avatars/469357d0a4a5d2e104ae5e32801b395d.svg","isPro":false,"fullname":"Jinze Bai","user":"Jinze","type":"user"},"name":"Jinze Bai","status":"admin_assigned","statusLastChangedAt":"2024-09-19T08:18:27.310Z","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb45","user":{"_id":"6461d675681b2e19b6acb5a5","avatarUrl":"/avatars/0d95d65d30f6672ec09dc92155324d7f.svg","isPro":false,"fullname":"Keqin Chen","user":"chenkq","type":"user"},"name":"Keqin Chen","status":"admin_assigned","statusLastChangedAt":"2024-09-19T08:18:40.819Z","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb46","user":{"_id":"6486c50978ab6bf101afc29f","avatarUrl":"/avatars/cd101a2c5188b48a1874f20756eb8f51.svg","isPro":false,"fullname":"XuejingLiu","user":"GingL","type":"user"},"name":"Xuejing Liu","status":"claimed_verified","statusLastChangedAt":"2025-10-31T14:34:42.980Z","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb47","user":{"_id":"6634979161776e1d8d35b16c","avatarUrl":"/avatars/32a1fac0016445959c2a062c1ab76ab9.svg","isPro":false,"fullname":"jialinwang","user":"jialinwangpku","type":"user"},"name":"Jialin Wang","status":"claimed_verified","statusLastChangedAt":"2025-02-21T10:00:42.188Z","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb48","user":{"_id":"634d06c6f0a69955f662e641","avatarUrl":"/avatars/5a0af8af0a21d2a93192f4a3c430fc60.svg","isPro":false,"fullname":"Wenbin Ge","user":"gewenbin292","type":"user"},"name":"Wenbin Ge","status":"admin_assigned","statusLastChangedAt":"2024-09-19T08:19:18.606Z","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb49","name":"Yang Fan","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb4a","name":"Kai Dang","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb4b","user":{"_id":"657b13156c149b6a964286f6","avatarUrl":"/avatars/aae7491a85dd7d18c022b5657c99f626.svg","isPro":false,"fullname":"Mengfei Du","user":"Phineas476","type":"user"},"name":"Mengfei Du","status":"admin_assigned","statusLastChangedAt":"2024-09-19T08:15:51.142Z","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb4c","name":"Xuancheng Ren","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb4d","user":{"_id":"6209bb200436d7d6f27cbeea","avatarUrl":"/avatars/0b8a72a8b66ef7b36780fe2ccc343f78.svg","isPro":false,"fullname":"Iurnem","user":"Iurnem","type":"user"},"name":"Rui Men","status":"claimed_verified","statusLastChangedAt":"2024-09-23T16:29:51.031Z","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb4e","user":{"_id":"6434d4989bd5a84b5dd0b0f5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434d4989bd5a84b5dd0b0f5/0Elf9qbfG9Hkgypm9pTGm.jpeg","isPro":false,"fullname":"Dayiheng Liu","user":"Losin94","type":"user"},"name":"Dayiheng Liu","status":"admin_assigned","statusLastChangedAt":"2024-09-19T08:15:38.586Z","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb4f","user":{"_id":"622892774323cef93a956a4a","avatarUrl":"/avatars/e57ef5c3b0c4289988ccd42f14e54336.svg","isPro":false,"fullname":"chang zhou","user":"jiemizc","type":"user"},"name":"Chang Zhou","status":"admin_assigned","statusLastChangedAt":"2024-09-19T08:15:31.346Z","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb50","user":{"_id":"602f88f5e8149a962412a667","avatarUrl":"/avatars/b78f0e583df8e5d5e3365934fe5f4900.svg","isPro":false,"fullname":"Zhou","user":"Jingren","type":"user"},"name":"Jingren Zhou","status":"admin_assigned","statusLastChangedAt":"2024-09-19T08:15:09.306Z","hidden":false},{"_id":"66eb8aa9fe376db0aa90bb51","user":{"_id":"620760a26e3b7210c2ff1943","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620760a26e3b7210c2ff1943/VC-rKqimF6yxGESNVlPoR.jpeg","isPro":false,"fullname":"Junyang Lin","user":"JustinLin610","type":"user"},"name":"Junyang Lin","status":"admin_assigned","statusLastChangedAt":"2024-09-19T08:14:59.832Z","hidden":false}],"publishedAt":"2024-09-18T17:59:32.000Z","submittedOnDailyAt":"2024-09-19T00:52:02.322Z","title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at\n Any Resolution","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL\nmodels that redefines the conventional predetermined-resolution approach in\nvisual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,\nwhich enables the model to dynamically process images of varying resolutions\ninto different numbers of visual tokens. This approach allows the model to\ngenerate more efficient and accurate visual representations, closely aligning\nwith human perceptual processes. The model also integrates Multimodal Rotary\nPosition Embedding (M-RoPE), facilitating the effective fusion of positional\ninformation across text, images, and videos. We employ a unified paradigm for\nprocessing both images and videos, enhancing the model's visual perception\ncapabilities. To explore the potential of large multimodal models, Qwen2-VL\ninvestigates the scaling laws for large vision-language models (LVLMs). By\nscaling both the model size-with versions at 2B, 8B, and 72B parameters-and the\namount of training data, the Qwen2-VL Series achieves highly competitive\nperformance. Notably, the Qwen2-VL-72B model achieves results comparable to\nleading models such as GPT-4o and Claude3.5-Sonnet across various multimodal\nbenchmarks, outperforming other generalist models. Code is available at\nhttps://github.com/QwenLM/Qwen2-VL.","upvotes":78,"discussionId":"66eb8aaefe376db0aa90bcae","githubRepo":"https://github.com/qwenlm/qwen2-vl","githubRepoAddedBy":"auto","ai_summary":"The Qwen2-VL Series uses Naive Dynamic Resolution and Multimodal Rotary Position Embedding to enhance visual processing and achieves competitive performance on multimodal benchmarks.","ai_keywords":["Naive Dynamic Resolution","visual tokens","Multimodal Rotary Position Embedding","LVLMs","Qwen2-VL","GPT-4","Claude3.5-Sonnet","multimodal benchmarks"],"githubStars":18330},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"64d47a7a508a6313e33faedd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d47a7a508a6313e33faedd/8DrbXdxA-sKZb4jgNhvY1.jpeg","isPro":false,"fullname":"Yuan Liu","user":"YuanLiuuuuuu","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","isPro":true,"fullname":"Orr Zohar","user":"orrzohar","type":"user"},{"_id":"64b4eec4faa3181a5eab9c46","avatarUrl":"/avatars/bcc9bf5cbf67546ad2b4c9ec8b96ac96.svg","isPro":true,"fullname":"Jiaqi Wang","user":"myownskyW7","type":"user"},{"_id":"6612aedf09f16e7347dfa7e1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6612aedf09f16e7347dfa7e1/bPYjBXCedY_1fSIPjoBTY.jpeg","isPro":false,"fullname":"Nishith Jain","user":"KingNish","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"652965773a416e1f2173443b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652965773a416e1f2173443b/y9MB8YgHzbwCXAc4EI9T3.jpeg","isPro":true,"fullname":"Yuhao Dong","user":"THUdyh","type":"user"},{"_id":"6065a9cbe43e52694178ed78","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6065a9cbe43e52694178ed78/TwnrOfTacrwtLnBC8cS8e.jpeg","isPro":false,"fullname":"Emanuele Vivoli","user":"emanuelevivoli","type":"user"},{"_id":"6640bbd0220cfa8cbfdce080","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6640bbd0220cfa8cbfdce080/wiAHUu5ewawyipNs0YFBR.png","isPro":true,"fullname":"John Smith","user":"John6666","type":"user"},{"_id":"636f533c1ca0ea5107ed171d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/636f533c1ca0ea5107ed171d/jLwsrcPtUiHj8WhcE0Y67.jpeg","isPro":false,"fullname":"Bhimraj Yadav","user":"bhimrazy","type":"user"},{"_id":"66b6c8e0e5b7e3da656b8f9d","avatarUrl":"/avatars/abe6bb3f473e6228274b9f4c13a8ef80.svg","isPro":false,"fullname":"Max Torop","user":"MaxTorop","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2409.12191

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Published on Sep 18, 2024
ยท Submitted by
AK
on Sep 19, 2024
Authors:
,
,
,
,
,
,

Abstract

The Qwen2-VL Series uses Naive Dynamic Resolution and Multimodal Rotary Position Embedding to enhance visual processing and achieves competitive performance on multimodal benchmarks.

AI-generated summary

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL.

Community

Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

only_claims.png

Give me all claim related above product

This comment has been hidden
This comment has been hidden

Sign up or log in to comment

Models citing this paper 411

Browse 411 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.12191 in a dataset README.md to link it from this page.

Spaces citing this paper 1,897

Collections including this paper 28