Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - NVILA: Efficient Frontier Visual Language Models
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-12-07T01:35:02.802Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.681360125541687},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2412.04468","authors":[{"_id":"67530600886878c8868445a9","user":{"_id":"650dac79b959b0e1d41d7378","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650dac79b959b0e1d41d7378/mzbN0MFk3k8b94FQ40I7L.jpeg","isPro":false,"fullname":"Zhijian Liu","user":"zhijianliu","type":"user"},"name":"Zhijian Liu","status":"claimed_verified","statusLastChangedAt":"2025-05-30T06:59:20.813Z","hidden":false},{"_id":"67530600886878c8868445aa","user":{"_id":"62b4b5beb25cb80fcf278354","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b4b5beb25cb80fcf278354/7uGNOeE91paZikhK2QFgp.jpeg","isPro":false,"fullname":"Ligeng Zhu","user":"Ligeng-Zhu","type":"user"},"name":"Ligeng Zhu","status":"claimed_verified","statusLastChangedAt":"2024-12-30T19:37:12.108Z","hidden":false},{"_id":"67530600886878c8868445ab","user":{"_id":"649004218f7cbbc94c782db6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/AdgLVfAIpWlug4jXTaEK-.jpeg","isPro":true,"fullname":"Baifeng Shi","user":"bfshi","type":"user"},"name":"Baifeng Shi","status":"claimed_verified","statusLastChangedAt":"2024-12-09T11:06:59.994Z","hidden":false},{"_id":"67530600886878c8868445ac","name":"Zhuoyang Zhang","hidden":false},{"_id":"67530600886878c8868445ad","name":"Yuming Lou","hidden":false},{"_id":"67530600886878c8868445ae","user":{"_id":"641d8bacd526196afc12766d","avatarUrl":"/avatars/73f7b2d86a7bf27940bec2b1f199d71b.svg","isPro":false,"fullname":"Shang Yang","user":"Shangy","type":"user"},"name":"Shang Yang","status":"claimed_verified","statusLastChangedAt":"2025-02-21T10:00:24.530Z","hidden":false},{"_id":"67530600886878c8868445af","user":{"_id":"66ce751a8ec9fda2cf5a9e85","avatarUrl":"/avatars/c17093ca81dad007b3e50bae503955a7.svg","isPro":false,"fullname":"Haocheng Xi","user":"xihc-ucb","type":"user"},"name":"Haocheng Xi","status":"claimed_verified","statusLastChangedAt":"2025-05-27T07:57:11.027Z","hidden":false},{"_id":"67530600886878c8868445b0","user":{"_id":"64ebbae6895a36ab28de811a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ebbae6895a36ab28de811a/gBiaQP4paS4L13eu-yRm7.jpeg","isPro":false,"fullname":"Shiyi Cao","user":"eva98","type":"user"},"name":"Shiyi Cao","status":"claimed_verified","statusLastChangedAt":"2024-12-09T11:06:56.294Z","hidden":false},{"_id":"67530600886878c8868445b1","user":{"_id":"624ac662102fcdff87be51b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624ac662102fcdff87be51b9/rzNahZFFkp194170tactJ.jpeg","isPro":false,"fullname":"Yuxian Gu","user":"t1101675","type":"user"},"name":"Yuxian Gu","status":"claimed_verified","statusLastChangedAt":"2024-12-09T11:06:58.017Z","hidden":false},{"_id":"67530600886878c8868445b2","name":"Dacheng Li","hidden":false},{"_id":"67530600886878c8868445b3","name":"Xiuyu Li","hidden":false},{"_id":"67530600886878c8868445b4","name":"Yunhao Fang","hidden":false},{"_id":"67530600886878c8868445b5","user":{"_id":"62919485a29097b211bc7b83","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62919485a29097b211bc7b83/TX8iBGu5JSuFlrRvjPEBV.png","isPro":false,"fullname":"YukangChen","user":"Yukang","type":"user"},"name":"Yukang Chen","status":"claimed_verified","statusLastChangedAt":"2025-10-13T10:23:12.355Z","hidden":false},{"_id":"67530600886878c8868445b6","name":"Cheng-Yu Hsieh","hidden":false},{"_id":"67530600886878c8868445b7","name":"De-An Huang","hidden":false},{"_id":"67530600886878c8868445b8","name":"An-Chieh Cheng","hidden":false},{"_id":"67530600886878c8868445b9","name":"Vishwesh Nath","hidden":false},{"_id":"67530600886878c8868445ba","user":{"_id":"637a06580a77f602dc4ac922","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637a06580a77f602dc4ac922/qOJAhHOEE2N-HzRcZOu1L.jpeg","isPro":false,"fullname":"Jinyi Hu","user":"JamesHujy","type":"user"},"name":"Jinyi Hu","status":"claimed_verified","statusLastChangedAt":"2024-12-09T11:06:54.373Z","hidden":false},{"_id":"67530600886878c8868445bb","name":"Sifei Liu","hidden":false},{"_id":"67530600886878c8868445bc","name":"Ranjay Krishna","hidden":false},{"_id":"67530600886878c8868445bd","name":"Daguang Xu","hidden":false},{"_id":"67530600886878c8868445be","name":"Xiaolong Wang","hidden":false},{"_id":"67530600886878c8868445bf","user":{"_id":"646d0c1c534e52f8c30500a6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646d0c1c534e52f8c30500a6/75VH8ClbRaP75BU2ONfXE.png","isPro":true,"fullname":"Pavlo Molchanov","user":"pmolchanov","type":"user"},"name":"Pavlo Molchanov","status":"claimed_verified","statusLastChangedAt":"2025-04-20T15:04:52.036Z","hidden":false},{"_id":"67530600886878c8868445c0","name":"Jan Kautz","hidden":false},{"_id":"67530600886878c8868445c1","user":{"_id":"65a8b7f69aec1645994e7a15","avatarUrl":"/avatars/debc086f3fea029db22847bde80799a0.svg","isPro":false,"fullname":"Hongxu Yin","user":"yinhongxu","type":"user"},"name":"Hongxu Yin","status":"claimed_verified","statusLastChangedAt":"2025-03-27T09:05:39.703Z","hidden":false},{"_id":"67530600886878c8868445c2","name":"Song Han","hidden":false},{"_id":"67530600886878c8868445c3","name":"Yao Lu","hidden":false}],"publishedAt":"2024-12-05T18:59:55.000Z","submittedOnDailyAt":"2024-12-06T18:28:53.023Z","title":"NVILA: Efficient Frontier Visual Language Models","submittedOnDailyBy":{"_id":"650dac79b959b0e1d41d7378","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650dac79b959b0e1d41d7378/mzbN0MFk3k8b94FQ40I7L.jpeg","isPro":false,"fullname":"Zhijian Liu","user":"zhijianliu","type":"user"},"summary":"Visual language models (VLMs) have made significant advances in accuracy in\nrecent years. However, their efficiency has received much less attention. This\npaper introduces NVILA, a family of open VLMs designed to optimize both\nefficiency and accuracy. Building on top of VILA, we improve its model\narchitecture by first scaling up the spatial and temporal resolutions, and then\ncompressing visual tokens. This \"scale-then-compress\" approach enables NVILA to\nefficiently process high-resolution images and long videos. We also conduct a\nsystematic investigation to enhance the efficiency of NVILA throughout its\nentire lifecycle, from training and fine-tuning to deployment. NVILA matches or\nsurpasses the accuracy of many leading open and proprietary VLMs across a wide\nrange of image and video benchmarks. At the same time, it reduces training\ncosts by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by\n1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and\nmodels available to facilitate reproducibility.","upvotes":60,"discussionId":"67530604886878c886844750","projectPage":"https://nvlabs.github.io/VILA/","githubRepo":"https://github.com/NVlabs/VILA","githubRepoAddedBy":"user","ai_summary":"NVILA, a family of VLMs, optimizes efficiency and accuracy through a scale-then-compress approach, enhancing performance across various benchmarks while reducing computational costs.","ai_keywords":["Visual language models","NVILA","VILA","scale-then-compress","spatial resolutions","temporal resolutions","visual tokens","system lifecycle","training","fine-tuning","deployment","training costs","fine-tuning memory usage","pre-filling latency","decoding latency"],"githubStars":3760},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"650dac79b959b0e1d41d7378","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650dac79b959b0e1d41d7378/mzbN0MFk3k8b94FQ40I7L.jpeg","isPro":false,"fullname":"Zhijian Liu","user":"zhijianliu","type":"user"},{"_id":"624ac662102fcdff87be51b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624ac662102fcdff87be51b9/rzNahZFFkp194170tactJ.jpeg","isPro":false,"fullname":"Yuxian Gu","user":"t1101675","type":"user"},{"_id":"649004218f7cbbc94c782db6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/AdgLVfAIpWlug4jXTaEK-.jpeg","isPro":true,"fullname":"Baifeng Shi","user":"bfshi","type":"user"},{"_id":"641d8bacd526196afc12766d","avatarUrl":"/avatars/73f7b2d86a7bf27940bec2b1f199d71b.svg","isPro":false,"fullname":"Shang Yang","user":"Shangy","type":"user"},{"_id":"646d0c1c534e52f8c30500a6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646d0c1c534e52f8c30500a6/75VH8ClbRaP75BU2ONfXE.png","isPro":true,"fullname":"Pavlo Molchanov","user":"pmolchanov","type":"user"},{"_id":"62fb8412610dae1bcd036db7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62fb8412610dae1bcd036db7/LIaEoDrvrq8hDl8509ZuF.jpeg","isPro":false,"fullname":"Qinghao Hu","user":"Qinghao","type":"user"},{"_id":"63715b25ffc0489ed7d1f415","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63715b25ffc0489ed7d1f415/xZJepbs0LRqFbW1knnBKR.jpeg","isPro":false,"fullname":"Dacheng Li","user":"DachengLi","type":"user"},{"_id":"6587e3a17959448ef5fc7735","avatarUrl":"/avatars/ab1c723226f1f932009b9a49546d53f2.svg","isPro":false,"fullname":"Jason (Yao Lu)","user":"klldmofashi","type":"user"},{"_id":"6452a2cbc895e6437dc567e2","avatarUrl":"/avatars/f488ebe2fbece0e8ef4e9d6c95898f3a.svg","isPro":false,"fullname":"Haocheng Xi","user":"Xihc20","type":"user"},{"_id":"64caca9b4ca74090b97a8795","avatarUrl":"/avatars/310f7703ff3c07d283c02aca56fe86df.svg","isPro":false,"fullname":"Saurav Muralidharan","user":"srvm","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"65a8b7f69aec1645994e7a15","avatarUrl":"/avatars/debc086f3fea029db22847bde80799a0.svg","isPro":false,"fullname":"Hongxu Yin","user":"yinhongxu","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2412.04468

NVILA: Efficient Frontier Visual Language Models

Published on Dec 5, 2024
· Submitted by
Zhijian Liu
on Dec 6, 2024
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

NVILA, a family of VLMs, optimizes efficiency and accuracy through a scale-then-compress approach, enhancing performance across various benchmarks while reducing computational costs.

AI-generated summary

Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.

Community

Paper author Paper submitter

Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 28

Browse 28 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 3

Collections including this paper 10