Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-12-07T01:35:02.802Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.681360125541687},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2412.04468","authors":[{"_id":"67530600886878c8868445a9","user":{"_id":"650dac79b959b0e1d41d7378","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650dac79b959b0e1d41d7378/mzbN0MFk3k8b94FQ40I7L.jpeg","isPro":false,"fullname":"Zhijian Liu","user":"zhijianliu","type":"user"},"name":"Zhijian Liu","status":"claimed_verified","statusLastChangedAt":"2025-05-30T06:59:20.813Z","hidden":false},{"_id":"67530600886878c8868445aa","user":{"_id":"62b4b5beb25cb80fcf278354","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b4b5beb25cb80fcf278354/7uGNOeE91paZikhK2QFgp.jpeg","isPro":false,"fullname":"Ligeng Zhu","user":"Ligeng-Zhu","type":"user"},"name":"Ligeng Zhu","status":"claimed_verified","statusLastChangedAt":"2024-12-30T19:37:12.108Z","hidden":false},{"_id":"67530600886878c8868445ab","user":{"_id":"649004218f7cbbc94c782db6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/AdgLVfAIpWlug4jXTaEK-.jpeg","isPro":true,"fullname":"Baifeng Shi","user":"bfshi","type":"user"},"name":"Baifeng Shi","status":"claimed_verified","statusLastChangedAt":"2024-12-09T11:06:59.994Z","hidden":false},{"_id":"67530600886878c8868445ac","name":"Zhuoyang Zhang","hidden":false},{"_id":"67530600886878c8868445ad","name":"Yuming Lou","hidden":false},{"_id":"67530600886878c8868445ae","user":{"_id":"641d8bacd526196afc12766d","avatarUrl":"/avatars/73f7b2d86a7bf27940bec2b1f199d71b.svg","isPro":false,"fullname":"Shang Yang","user":"Shangy","type":"user"},"name":"Shang Yang","status":"claimed_verified","statusLastChangedAt":"2025-02-21T10:00:24.530Z","hidden":false},{"_id":"67530600886878c8868445af","user":{"_id":"66ce751a8ec9fda2cf5a9e85","avatarUrl":"/avatars/c17093ca81dad007b3e50bae503955a7.svg","isPro":false,"fullname":"Haocheng Xi","user":"xihc-ucb","type":"user"},"name":"Haocheng Xi","status":"claimed_verified","statusLastChangedAt":"2025-05-27T07:57:11.027Z","hidden":false},{"_id":"67530600886878c8868445b0","user":{"_id":"64ebbae6895a36ab28de811a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ebbae6895a36ab28de811a/gBiaQP4paS4L13eu-yRm7.jpeg","isPro":false,"fullname":"Shiyi Cao","user":"eva98","type":"user"},"name":"Shiyi Cao","status":"claimed_verified","statusLastChangedAt":"2024-12-09T11:06:56.294Z","hidden":false},{"_id":"67530600886878c8868445b1","user":{"_id":"624ac662102fcdff87be51b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624ac662102fcdff87be51b9/rzNahZFFkp194170tactJ.jpeg","isPro":false,"fullname":"Yuxian Gu","user":"t1101675","type":"user"},"name":"Yuxian Gu","status":"claimed_verified","statusLastChangedAt":"2024-12-09T11:06:58.017Z","hidden":false},{"_id":"67530600886878c8868445b2","name":"Dacheng Li","hidden":false},{"_id":"67530600886878c8868445b3","name":"Xiuyu Li","hidden":false},{"_id":"67530600886878c8868445b4","name":"Yunhao Fang","hidden":false},{"_id":"67530600886878c8868445b5","user":{"_id":"62919485a29097b211bc7b83","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62919485a29097b211bc7b83/TX8iBGu5JSuFlrRvjPEBV.png","isPro":false,"fullname":"YukangChen","user":"Yukang","type":"user"},"name":"Yukang Chen","status":"claimed_verified","statusLastChangedAt":"2025-10-13T10:23:12.355Z","hidden":false},{"_id":"67530600886878c8868445b6","name":"Cheng-Yu Hsieh","hidden":false},{"_id":"67530600886878c8868445b7","name":"De-An Huang","hidden":false},{"_id":"67530600886878c8868445b8","name":"An-Chieh Cheng","hidden":false},{"_id":"67530600886878c8868445b9","name":"Vishwesh Nath","hidden":false},{"_id":"67530600886878c8868445ba","user":{"_id":"637a06580a77f602dc4ac922","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637a06580a77f602dc4ac922/qOJAhHOEE2N-HzRcZOu1L.jpeg","isPro":false,"fullname":"Jinyi Hu","user":"JamesHujy","type":"user"},"name":"Jinyi Hu","status":"claimed_verified","statusLastChangedAt":"2024-12-09T11:06:54.373Z","hidden":false},{"_id":"67530600886878c8868445bb","name":"Sifei Liu","hidden":false},{"_id":"67530600886878c8868445bc","name":"Ranjay Krishna","hidden":false},{"_id":"67530600886878c8868445bd","name":"Daguang Xu","hidden":false},{"_id":"67530600886878c8868445be","name":"Xiaolong Wang","hidden":false},{"_id":"67530600886878c8868445bf","user":{"_id":"646d0c1c534e52f8c30500a6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646d0c1c534e52f8c30500a6/75VH8ClbRaP75BU2ONfXE.png","isPro":true,"fullname":"Pavlo Molchanov","user":"pmolchanov","type":"user"},"name":"Pavlo Molchanov","status":"claimed_verified","statusLastChangedAt":"2025-04-20T15:04:52.036Z","hidden":false},{"_id":"67530600886878c8868445c0","name":"Jan Kautz","hidden":false},{"_id":"67530600886878c8868445c1","user":{"_id":"65a8b7f69aec1645994e7a15","avatarUrl":"/avatars/debc086f3fea029db22847bde80799a0.svg","isPro":false,"fullname":"Hongxu Yin","user":"yinhongxu","type":"user"},"name":"Hongxu Yin","status":"claimed_verified","statusLastChangedAt":"2025-03-27T09:05:39.703Z","hidden":false},{"_id":"67530600886878c8868445c2","name":"Song Han","hidden":false},{"_id":"67530600886878c8868445c3","name":"Yao Lu","hidden":false}],"publishedAt":"2024-12-05T18:59:55.000Z","submittedOnDailyAt":"2024-12-06T18:28:53.023Z","title":"NVILA: Efficient Frontier Visual Language Models","submittedOnDailyBy":{"_id":"650dac79b959b0e1d41d7378","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650dac79b959b0e1d41d7378/mzbN0MFk3k8b94FQ40I7L.jpeg","isPro":false,"fullname":"Zhijian Liu","user":"zhijianliu","type":"user"},"summary":"Visual language models (VLMs) have made significant advances in accuracy in\nrecent years. However, their efficiency has received much less attention. This\npaper introduces NVILA, a family of open VLMs designed to optimize both\nefficiency and accuracy. Building on top of VILA, we improve its model\narchitecture by first scaling up the spatial and temporal resolutions, and then\ncompressing visual tokens. This \"scale-then-compress\" approach enables NVILA to\nefficiently process high-resolution images and long videos. We also conduct a\nsystematic investigation to enhance the efficiency of NVILA throughout its\nentire lifecycle, from training and fine-tuning to deployment. NVILA matches or\nsurpasses the accuracy of many leading open and proprietary VLMs across a wide\nrange of image and video benchmarks. At the same time, it reduces training\ncosts by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by\n1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and\nmodels available to facilitate reproducibility.","upvotes":60,"discussionId":"67530604886878c886844750","projectPage":"https://nvlabs.github.io/VILA/","githubRepo":"https://github.com/NVlabs/VILA","githubRepoAddedBy":"user","ai_summary":"NVILA, a family of VLMs, optimizes efficiency and accuracy through a scale-then-compress approach, enhancing performance across various benchmarks while reducing computational costs.","ai_keywords":["Visual language models","NVILA","VILA","scale-then-compress","spatial resolutions","temporal resolutions","visual tokens","system lifecycle","training","fine-tuning","deployment","training costs","fine-tuning memory usage","pre-filling latency","decoding latency"],"githubStars":3760},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"650dac79b959b0e1d41d7378","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650dac79b959b0e1d41d7378/mzbN0MFk3k8b94FQ40I7L.jpeg","isPro":false,"fullname":"Zhijian Liu","user":"zhijianliu","type":"user"},{"_id":"624ac662102fcdff87be51b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624ac662102fcdff87be51b9/rzNahZFFkp194170tactJ.jpeg","isPro":false,"fullname":"Yuxian Gu","user":"t1101675","type":"user"},{"_id":"649004218f7cbbc94c782db6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/AdgLVfAIpWlug4jXTaEK-.jpeg","isPro":true,"fullname":"Baifeng Shi","user":"bfshi","type":"user"},{"_id":"641d8bacd526196afc12766d","avatarUrl":"/avatars/73f7b2d86a7bf27940bec2b1f199d71b.svg","isPro":false,"fullname":"Shang Yang","user":"Shangy","type":"user"},{"_id":"646d0c1c534e52f8c30500a6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646d0c1c534e52f8c30500a6/75VH8ClbRaP75BU2ONfXE.png","isPro":true,"fullname":"Pavlo Molchanov","user":"pmolchanov","type":"user"},{"_id":"62fb8412610dae1bcd036db7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62fb8412610dae1bcd036db7/LIaEoDrvrq8hDl8509ZuF.jpeg","isPro":false,"fullname":"Qinghao Hu","user":"Qinghao","type":"user"},{"_id":"63715b25ffc0489ed7d1f415","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63715b25ffc0489ed7d1f415/xZJepbs0LRqFbW1knnBKR.jpeg","isPro":false,"fullname":"Dacheng Li","user":"DachengLi","type":"user"},{"_id":"6587e3a17959448ef5fc7735","avatarUrl":"/avatars/ab1c723226f1f932009b9a49546d53f2.svg","isPro":false,"fullname":"Jason (Yao Lu)","user":"klldmofashi","type":"user"},{"_id":"6452a2cbc895e6437dc567e2","avatarUrl":"/avatars/f488ebe2fbece0e8ef4e9d6c95898f3a.svg","isPro":false,"fullname":"Haocheng Xi","user":"Xihc20","type":"user"},{"_id":"64caca9b4ca74090b97a8795","avatarUrl":"/avatars/310f7703ff3c07d283c02aca56fe86df.svg","isPro":false,"fullname":"Saurav Muralidharan","user":"srvm","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"65a8b7f69aec1645994e7a15","avatarUrl":"/avatars/debc086f3fea029db22847bde80799a0.svg","isPro":false,"fullname":"Hongxu Yin","user":"yinhongxu","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2412.04468

NVILA: Efficient Frontier Visual Language Models

Published on Dec 5, 2024

· Submitted by

Zhijian Liu on Dec 6, 2024

Upvote

Authors:

Zhijian Liu ,

Ligeng Zhu ,

Baifeng Shi ,

Shang Yang ,

Haocheng Xi ,

Shiyi Cao ,

Yuxian Gu ,

Yukang Chen ,

Jinyi Hu ,

Abstract

NVILA, a family of VLMs, optimizes efficiency and accuracy through a scale-then-compress approach, enhancing performance across various benchmarks while reducing computational costs.

AI-generated summary

Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.