Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - NVILA: Efficient Frontier Visual Language Models
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-12-07T01:35:02.802Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.681360125541687},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2412.04468","authors":[{"_id":"67530600886878c8868445a9","user":{"_id":"650dac79b959b0e1d41d7378","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650dac79b959b0e1d41d7378/mzbN0MFk3k8b94FQ40I7L.jpeg","isPro":false,"fullname":"Zhijian Liu","user":"zhijianliu","type":"user"},"name":"Zhijian Liu","status":"claimed_verified","statusLastChangedAt":"2025-05-30T06:59:20.813Z","hidden":false},{"_id":"67530600886878c8868445aa","user":{"_id":"62b4b5beb25cb80fcf278354","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b4b5beb25cb80fcf278354/7uGNOeE91paZikhK2QFgp.jpeg","isPro":false,"fullname":"Ligeng Zhu","user":"Ligeng-Zhu","type":"user"},"name":"Ligeng Zhu","status":"claimed_verified","statusLastChangedAt":"2024-12-30T19:37:12.108Z","hidden":false},{"_id":"67530600886878c8868445ab","user":{"_id":"649004218f7cbbc94c782db6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/AdgLVfAIpWlug4jXTaEK-.jpeg","isPro":true,"fullname":"Baifeng Shi","user":"bfshi","type":"user"},"name":"Baifeng Shi","status":"claimed_verified","statusLastChangedAt":"2024-12-09T11:06:59.994Z","hidden":false},{"_id":"67530600886878c8868445ac","name":"Zhuoyang Zhang","hidden":false},{"_id":"67530600886878c8868445ad","name":"Yuming Lou","hidden":false},{"_id":"67530600886878c8868445ae","user":{"_id":"641d8bacd526196afc12766d","avatarUrl":"/avatars/73f7b2d86a7bf27940bec2b1f199d71b.svg","isPro":false,"fullname":"Shang Yang","user":"Shangy","type":"user"},"name":"Shang Yang","status":"claimed_verified","statusLastChangedAt":"2025-02-21T10:00:24.530Z","hidden":false},{"_id":"67530600886878c8868445af","user":{"_id":"66ce751a8ec9fda2cf5a9e85","avatarUrl":"/avatars/c17093ca81dad007b3e50bae503955a7.svg","isPro":false,"fullname":"Haocheng Xi","user":"xihc-ucb","type":"user"},"name":"Haocheng Xi","status":"claimed_verified","statusLastChangedAt":"2025-05-27T07:57:11.027Z","hidden":false},{"_id":"67530600886878c8868445b0","user":{"_id":"64ebbae6895a36ab28de811a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ebbae6895a36ab28de811a/gBiaQP4paS4L13eu-yRm7.jpeg","isPro":false,"fullname":"Shiyi Cao","user":"eva98","type":"user"},"name":"Shiyi Cao","status":"claimed_verified","statusLastChangedAt":"2024-12-09T11:06:56.294Z","hidden":false},{"_id":"67530600886878c8868445b1","user":{"_id":"624ac662102fcdff87be51b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624ac662102fcdff87be51b9/rzNahZFFkp194170tactJ.jpeg","isPro":false,"fullname":"Yuxian Gu","user":"t1101675","type":"user"},"name":"Yuxian Gu","status":"claimed_verified","statusLastChangedAt":"2024-12-09T11:06:58.017Z","hidden":false},{"_id":"67530600886878c8868445b2","name":"Dacheng Li","hidden":false},{"_id":"67530600886878c8868445b3","name":"Xiuyu Li","hidden":false},{"_id":"67530600886878c8868445b4","name":"Yunhao Fang","hidden":false},{"_id":"67530600886878c8868445b5","user":{"_id":"62919485a29097b211bc7b83","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62919485a29097b211bc7b83/TX8iBGu5JSuFlrRvjPEBV.png","isPro":false,"fullname":"YukangChen","user":"Yukang","type":"user"},"name":"Yukang Chen","status":"claimed_verified","statusLastChangedAt":"2025-10-13T10:23:12.355Z","hidden":false},{"_id":"67530600886878c8868445b6","name":"Cheng-Yu Hsieh","hidden":false},{"_id":"67530600886878c8868445b7","name":"De-An Huang","hidden":false},{"_id":"67530600886878c8868445b8","name":"An-Chieh Cheng","hidden":false},{"_id":"67530600886878c8868445b9","name":"Vishwesh Nath","hidden":false},{"_id":"67530600886878c8868445ba","user":{"_id":"637a06580a77f602dc4ac922","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637a06580a77f602dc4ac922/qOJAhHOEE2N-HzRcZOu1L.jpeg","isPro":false,"fullname":"Jinyi Hu","user":"JamesHujy","type":"user"},"name":"Jinyi Hu","status":"claimed_verified","statusLastChangedAt":"2024-12-09T11:06:54.373Z","hidden":false},{"_id":"67530600886878c8868445bb","name":"Sifei Liu","hidden":false},{"_id":"67530600886878c8868445bc","name":"Ranjay Krishna","hidden":false},{"_id":"67530600886878c8868445bd","name":"Daguang Xu","hidden":false},{"_id":"67530600886878c8868445be","name":"Xiaolong Wang","hidden":false},{"_id":"67530600886878c8868445bf","user":{"_id":"646d0c1c534e52f8c30500a6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646d0c1c534e52f8c30500a6/75VH8ClbRaP75BU2ONfXE.png","isPro":true,"fullname":"Pavlo Molchanov","user":"pmolchanov","type":"user"},"name":"Pavlo Molchanov","status":"claimed_verified","statusLastChangedAt":"2025-04-20T15:04:52.036Z","hidden":false},{"_id":"67530600886878c8868445c0","name":"Jan Kautz","hidden":false},{"_id":"67530600886878c8868445c1","user":{"_id":"65a8b7f69aec1645994e7a15","avatarUrl":"/avatars/debc086f3fea029db22847bde80799a0.svg","isPro":false,"fullname":"Hongxu Yin","user":"yinhongxu","type":"user"},"name":"Hongxu Yin","status":"claimed_verified","statusLastChangedAt":"2025-03-27T09:05:39.703Z","hidden":false},{"_id":"67530600886878c8868445c2","name":"Song Han","hidden":false},{"_id":"67530600886878c8868445c3","name":"Yao Lu","hidden":false}],"publishedAt":"2024-12-05T18:59:55.000Z","submittedOnDailyAt":"2024-12-06T18:28:53.023Z","title":"NVILA: Efficient Frontier Visual Language Models","submittedOnDailyBy":{"_id":"650dac79b959b0e1d41d7378","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650dac79b959b0e1d41d7378/mzbN0MFk3k8b94FQ40I7L.jpeg","isPro":false,"fullname":"Zhijian Liu","user":"zhijianliu","type":"user"},"summary":"Visual language models (VLMs) have made significant advances in accuracy in\nrecent years. However, their efficiency has received much less attention. This\npaper introduces NVILA, a family of open VLMs designed to optimize both\nefficiency and accuracy. Building on top of VILA, we improve its model\narchitecture by first scaling up the spatial and temporal resolutions, and then\ncompressing visual tokens. This \"scale-then-compress\" approach enables NVILA to\nefficiently process high-resolution images and long videos. We also conduct a\nsystematic investigation to enhance the efficiency of NVILA throughout its\nentire lifecycle, from training and fine-tuning to deployment. NVILA matches or\nsurpasses the accuracy of many leading open and proprietary VLMs across a wide\nrange of image and video benchmarks. At the same time, it reduces training\ncosts by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by\n1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and\nmodels available to facilitate reproducibility.","upvotes":60,"discussionId":"67530604886878c886844750","projectPage":"https://nvlabs.github.io/VILA/","githubRepo":"https://github.com/NVlabs/VILA","githubRepoAddedBy":"user","ai_summary":"NVILA, a family of VLMs, optimizes efficiency and accuracy through a scale-then-compress approach, enhancing performance across various benchmarks while reducing computational costs.","ai_keywords":["Visual language models","NVILA","VILA","scale-then-compress","spatial resolutions","temporal resolutions","visual tokens","system lifecycle","training","fine-tuning","deployment","training costs","fine-tuning memory usage","pre-filling latency","decoding latency"],"githubStars":3760},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"650dac79b959b0e1d41d7378","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650dac79b959b0e1d41d7378/mzbN0MFk3k8b94FQ40I7L.jpeg","isPro":false,"fullname":"Zhijian Liu","user":"zhijianliu","type":"user"},{"_id":"624ac662102fcdff87be51b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624ac662102fcdff87be51b9/rzNahZFFkp194170tactJ.jpeg","isPro":false,"fullname":"Yuxian Gu","user":"t1101675","type":"user"},{"_id":"649004218f7cbbc94c782db6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/AdgLVfAIpWlug4jXTaEK-.jpeg","isPro":true,"fullname":"Baifeng Shi","user":"bfshi","type":"user"},{"_id":"641d8bacd526196afc12766d","avatarUrl":"/avatars/73f7b2d86a7bf27940bec2b1f199d71b.svg","isPro":false,"fullname":"Shang Yang","user":"Shangy","type":"user"},{"_id":"646d0c1c534e52f8c30500a6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646d0c1c534e52f8c30500a6/75VH8ClbRaP75BU2ONfXE.png","isPro":true,"fullname":"Pavlo Molchanov","user":"pmolchanov","type":"user"},{"_id":"62fb8412610dae1bcd036db7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62fb8412610dae1bcd036db7/LIaEoDrvrq8hDl8509ZuF.jpeg","isPro":false,"fullname":"Qinghao Hu","user":"Qinghao","type":"user"},{"_id":"63715b25ffc0489ed7d1f415","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63715b25ffc0489ed7d1f415/xZJepbs0LRqFbW1knnBKR.jpeg","isPro":false,"fullname":"Dacheng Li","user":"DachengLi","type":"user"},{"_id":"6587e3a17959448ef5fc7735","avatarUrl":"/avatars/ab1c723226f1f932009b9a49546d53f2.svg","isPro":false,"fullname":"Jason (Yao Lu)","user":"klldmofashi","type":"user"},{"_id":"6452a2cbc895e6437dc567e2","avatarUrl":"/avatars/f488ebe2fbece0e8ef4e9d6c95898f3a.svg","isPro":false,"fullname":"Haocheng Xi","user":"Xihc20","type":"user"},{"_id":"64caca9b4ca74090b97a8795","avatarUrl":"/avatars/310f7703ff3c07d283c02aca56fe86df.svg","isPro":false,"fullname":"Saurav Muralidharan","user":"srvm","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"65a8b7f69aec1645994e7a15","avatarUrl":"/avatars/debc086f3fea029db22847bde80799a0.svg","isPro":false,"fullname":"Hongxu Yin","user":"yinhongxu","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
NVILA, a family of VLMs, optimizes efficiency and accuracy through a scale-then-compress approach, enhancing performance across various benchmarks while reducing computational costs.
AI-generated summary
Visual language models (VLMs) have made significant advances in accuracy in
recent years. However, their efficiency has received much less attention. This
paper introduces NVILA, a family of open VLMs designed to optimize both
efficiency and accuracy. Building on top of VILA, we improve its model
architecture by first scaling up the spatial and temporal resolutions, and then
compressing visual tokens. This "scale-then-compress" approach enables NVILA to
efficiently process high-resolution images and long videos. We also conduct a
systematic investigation to enhance the efficiency of NVILA throughout its
entire lifecycle, from training and fine-tuning to deployment. NVILA matches or
surpasses the accuracy of many leading open and proprietary VLMs across a wide
range of image and video benchmarks. At the same time, it reduces training
costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by
1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and
models available to facilitate reproducibility.
Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.