The following papers were recommended by the Semantic Scholar API
\n- \n
- Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models (2025) \n
- CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks (2025) \n
- LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction (2026) \n
- Revisiting Multi-Task Visual Representation Learning (2026) \n
- VACoT: Rethinking Visual Data Augmentation with VLMs (2025) \n
- The Spatial Blindspot of Vision-Language Models (2026) \n
- Forest Before Trees: Latent Superposition for Efficient Visual Reasoning (2026) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
Youtu-VL demo video
\n\n","updatedAt":"2026-01-30T03:32:00.633Z","author":{"_id":"66091e552c198b9518772591","avatarUrl":"/avatars/26e0c818bcd449b17f65463f7ee277f1.svg","fullname":"Zane","name":"zanekan01","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.30423110723495483},"editors":["zanekan01"],"editorAvatarUrls":["/avatars/26e0c818bcd449b17f65463f7ee277f1.svg"],"reactions":[{"reaction":"👍","users":["fujikoli","tedsun","kamamuta"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.19798","authors":[{"_id":"6979818fdf44b75fa47e477a","user":{"_id":"656312995475849b82c38bc4","avatarUrl":"/avatars/fd52f3a76a1ca475585d13c3f9e50c54.svg","isPro":true,"fullname":"zhixiangwei","user":"zhixiangwei","type":"user"},"name":"Zhixiang Wei","status":"claimed_verified","statusLastChangedAt":"2026-01-29T09:17:26.529Z","hidden":false},{"_id":"6979818fdf44b75fa47e477b","name":"Yi Li","hidden":false},{"_id":"6979818fdf44b75fa47e477c","user":{"_id":"66091e552c198b9518772591","avatarUrl":"/avatars/26e0c818bcd449b17f65463f7ee277f1.svg","isPro":false,"fullname":"Zane","user":"zanekan01","type":"user"},"name":"Zhehan Kan","status":"claimed_verified","statusLastChangedAt":"2026-01-30T09:37:34.532Z","hidden":false},{"_id":"6979818fdf44b75fa47e477d","name":"Xinghua Jiang","hidden":false},{"_id":"6979818fdf44b75fa47e477e","name":"Zuwei Long","hidden":false},{"_id":"6979818fdf44b75fa47e477f","user":{"_id":"674d5157ceb0c4f251d7f58d","avatarUrl":"/avatars/2c712d31932ceee42414628ea456a993.svg","isPro":false,"fullname":"ShifengLiu","user":"ShifengLiu","type":"user"},"name":"Shifeng Liu","status":"claimed_verified","statusLastChangedAt":"2026-01-30T09:37:36.596Z","hidden":false},{"_id":"6979818fdf44b75fa47e4780","name":"Hongze Shen","hidden":false},{"_id":"6979818fdf44b75fa47e4781","name":"Wei Liu","hidden":false},{"_id":"6979818fdf44b75fa47e4782","user":{"_id":"637af0a7bdf7309aa6da1c36","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637af0a7bdf7309aa6da1c36/NHZ-09otVCfbpXVxm8f-e.png","isPro":false,"fullname":"Xiaoyu Tan","user":"WIlliam1900","type":"user"},"name":"Xiaoyu Tan","status":"claimed_verified","statusLastChangedAt":"2026-01-29T09:17:24.379Z","hidden":false},{"_id":"6979818fdf44b75fa47e4783","name":"Haojia Lin","hidden":false},{"_id":"6979818fdf44b75fa47e4784","user":{"_id":"698aaaf5599fceb2ec8bbd20","avatarUrl":"/avatars/7bb16fdf780aae64f631d4c77bb4605d.svg","isPro":false,"fullname":"Yubo Zhu","user":"yuboz","type":"user"},"name":"Yubo Zhu","status":"claimed_verified","statusLastChangedAt":"2026-02-10T09:08:18.967Z","hidden":false},{"_id":"6979818fdf44b75fa47e4785","name":"Qianyu Li","hidden":false},{"_id":"6979818fdf44b75fa47e4786","user":{"_id":"63fc75f9b9db84750cea9c5c","avatarUrl":"/avatars/2c5bf9685e0cfc4b5785a4a86c34e0db.svg","isPro":false,"fullname":"DI YIN","user":"DIYIN","type":"user"},"name":"Di Yin","status":"claimed_verified","statusLastChangedAt":"2026-01-28T11:16:25.761Z","hidden":false},{"_id":"6979818fdf44b75fa47e4787","user":{"_id":"641c1a85999935676ec7ead4","avatarUrl":"/avatars/738a4e43bb3ddd4de287acfa26095b3d.svg","isPro":false,"fullname":"caohaoyu","user":"rechy","type":"user"},"name":"Haoyu Cao","status":"claimed_verified","statusLastChangedAt":"2026-01-30T09:37:30.988Z","hidden":false},{"_id":"6979818fdf44b75fa47e4788","name":"Weibo Gu","hidden":false},{"_id":"6979818fdf44b75fa47e4789","user":{"_id":"667016b9ccff4d0862460d88","avatarUrl":"/avatars/96959ed963af9eae2814079c5241af1c.svg","isPro":false,"fullname":"Xin Li","user":"fujikoli","type":"user"},"name":"Xin Li","status":"claimed_verified","statusLastChangedAt":"2026-01-28T11:28:49.228Z","hidden":false},{"_id":"6979818fdf44b75fa47e478a","user":{"_id":"6959daf62da262f6caa6b3b9","avatarUrl":"/avatars/3986907912cd2d502570b8969b02abcc.svg","isPro":false,"fullname":"Yinsong Liu","user":"Yinsongliu","type":"user"},"name":"Yinsong Liu","status":"claimed_verified","statusLastChangedAt":"2026-01-28T11:16:33.643Z","hidden":false},{"_id":"6979818fdf44b75fa47e478b","name":"Deqiang Jiang","hidden":false},{"_id":"6979818fdf44b75fa47e478c","user":{"_id":"647401e50da364bd0d002f2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/vPuPn7EV092mLBOM2YZXd.png","isPro":false,"fullname":"XING SUN","user":"tedsun","type":"user"},"name":"Xing Sun","status":"claimed_verified","statusLastChangedAt":"2026-01-28T11:16:28.308Z","hidden":false},{"_id":"6979818fdf44b75fa47e478d","name":"Yunsheng Wu","hidden":false},{"_id":"6979818fdf44b75fa47e478e","name":"Mingkong Tang","hidden":false},{"_id":"6979818fdf44b75fa47e478f","name":"Shuangyin Liu","hidden":false},{"_id":"6979818fdf44b75fa47e4790","name":"Lexiang Tang","hidden":false},{"_id":"6979818fdf44b75fa47e4791","user":{"_id":"68006d7f9c4da89b98b95d08","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68006d7f9c4da89b98b95d08/3oFLjqEQ7kN4z4Hmura8C.jpeg","isPro":false,"fullname":"HaoDong Lin","user":"linhaodong19","type":"user"},"name":"Haodong Lin","status":"claimed_verified","statusLastChangedAt":"2026-01-30T09:37:28.945Z","hidden":false},{"_id":"6979818fdf44b75fa47e4792","user":{"_id":"64ddbc3f1f2dad27e1a05ac1","avatarUrl":"/avatars/4fb2753b7998c8536bfd4780d3b10a6d.svg","isPro":false,"fullname":"Junrulu","user":"Junrulu","type":"user"},"name":"Junru Lu","status":"claimed_verified","statusLastChangedAt":"2026-01-28T11:16:30.779Z","hidden":false},{"_id":"6979818fdf44b75fa47e4793","user":{"_id":"64181d03edc5a69a66959b8a","avatarUrl":"/avatars/ca1748bc8fe0742158d836302c4292c7.svg","isPro":false,"fullname":"JR QIN","user":"qinjr","type":"user"},"name":"Jiarui Qin","status":"claimed_verified","statusLastChangedAt":"2026-01-29T09:17:35.396Z","hidden":false},{"_id":"6979818fdf44b75fa47e4794","name":"Lingfeng Qiao","hidden":false},{"_id":"6979818fdf44b75fa47e4795","name":"Ruizhi Qiao","hidden":false},{"_id":"6979818fdf44b75fa47e4796","name":"Bo Ke","hidden":false},{"_id":"6979818fdf44b75fa47e4797","name":"Jianfeng He","hidden":false},{"_id":"6979818fdf44b75fa47e4798","name":"Ke Li","hidden":false},{"_id":"6979818fdf44b75fa47e4799","name":"Yangning Li","hidden":false},{"_id":"6979818fdf44b75fa47e479a","name":"Yunhang Shen","hidden":false},{"_id":"6979818fdf44b75fa47e479b","name":"Mengdan Zhang","hidden":false},{"_id":"6979818fdf44b75fa47e479c","name":"Peixian Chen","hidden":false},{"_id":"6979818fdf44b75fa47e479d","name":"Kun Yin","hidden":false},{"_id":"6979818fdf44b75fa47e479e","name":"Bing Liu","hidden":false},{"_id":"6979818fdf44b75fa47e479f","name":"Yunfei Wu","hidden":false},{"_id":"6979818fdf44b75fa47e47a0","user":{"_id":"641c05d80742e5d0e21782d2","avatarUrl":"/avatars/40e0b0d22bd1992cfba7569a2f655ed0.svg","isPro":false,"fullname":"Huang Chen","user":"double22a","type":"user"},"name":"Huang Chen","status":"claimed_verified","statusLastChangedAt":"2026-01-30T09:37:26.965Z","hidden":false},{"_id":"6979818fdf44b75fa47e47a1","name":"Zhongpeng Cai","hidden":false},{"_id":"6979818fdf44b75fa47e47a2","name":"Xiaotian Li","hidden":false}],"publishedAt":"2026-01-27T17:01:16.000Z","submittedOnDailyAt":"2026-01-28T19:57:05.016Z","title":"Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision","submittedOnDailyBy":{"_id":"610a70f35a40a8bfebfbf09b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1659922312540-610a70f35a40a8bfebfbf09b.jpeg","isPro":true,"fullname":"Daniel Bourke","user":"mrdbourke","type":"user"},"summary":"Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.","upvotes":42,"discussionId":"69798190df44b75fa47e47a3","projectPage":"https://youtu-tip.com/#llm","githubRepo":"https://github.com/TencentCloudADP/youtu-vl","githubRepoAddedBy":"user","ai_summary":"Youtu-VL addresses limitations in Vision-Language Models by introducing a unified autoregressive supervision paradigm that treats visual signals as target outputs rather than passive inputs, enabling improved multimodal comprehension and vision-centric task performance.","ai_keywords":["Vision-Language Models","autoregressive supervision","vision-as-target","vision-as-input","visual tokens","multimodal comprehension","vision-centric tasks","generalist visual agents"],"githubStars":133,"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6959daf62da262f6caa6b3b9","avatarUrl":"/avatars/3986907912cd2d502570b8969b02abcc.svg","isPro":false,"fullname":"Yinsong Liu","user":"Yinsongliu","type":"user"},{"_id":"64ddbc3f1f2dad27e1a05ac1","avatarUrl":"/avatars/4fb2753b7998c8536bfd4780d3b10a6d.svg","isPro":false,"fullname":"Junrulu","user":"Junrulu","type":"user"},{"_id":"66091e552c198b9518772591","avatarUrl":"/avatars/26e0c818bcd449b17f65463f7ee277f1.svg","isPro":false,"fullname":"Zane","user":"zanekan01","type":"user"},{"_id":"674d5157ceb0c4f251d7f58d","avatarUrl":"/avatars/2c712d31932ceee42414628ea456a993.svg","isPro":false,"fullname":"ShifengLiu","user":"ShifengLiu","type":"user"},{"_id":"662b1e9aef7a4675bdff8bd0","avatarUrl":"/avatars/afc0628998316a9226f27976ca8c8376.svg","isPro":false,"fullname":"Yi Li","user":"yili7eli","type":"user"},{"_id":"69326ac4edf0cd8d0530adcb","avatarUrl":"/avatars/40e990a148e75053fce9c06cfe015b18.svg","isPro":false,"fullname":"Yubo Zhu","user":"efdsf324","type":"user"},{"_id":"6372813520a58a5e14c596a3","avatarUrl":"/avatars/9135151259db3e5b9c8969e1d00c949d.svg","isPro":false,"fullname":"XuHao Hu","user":"Foreshhh","type":"user"},{"_id":"647401e50da364bd0d002f2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/vPuPn7EV092mLBOM2YZXd.png","isPro":false,"fullname":"XING SUN","user":"tedsun","type":"user"},{"_id":"63fc75f9b9db84750cea9c5c","avatarUrl":"/avatars/2c5bf9685e0cfc4b5785a4a86c34e0db.svg","isPro":false,"fullname":"DI YIN","user":"DIYIN","type":"user"},{"_id":"63e9df3746574e63a2cc55c5","avatarUrl":"/avatars/3d27ad1ccfa51387e4b97d02e13deb41.svg","isPro":false,"fullname":"Lingfeng Qiao","user":"leafqiaoqiao","type":"user"},{"_id":"64181d03edc5a69a66959b8a","avatarUrl":"/avatars/ca1748bc8fe0742158d836302c4292c7.svg","isPro":false,"fullname":"JR QIN","user":"qinjr","type":"user"},{"_id":"64b02ec0e5000ae8a572ced5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b02ec0e5000ae8a572ced5/6ifLntBU2ICQK7SW8WxKU.png","isPro":false,"fullname":"Lin Chen","user":"Lin-Chen","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"}}">Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
Abstract
Youtu-VL addresses limitations in Vision-Language Models by introducing a unified autoregressive supervision paradigm that treats visual signals as target outputs rather than passive inputs, enabling improved multimodal comprehension and vision-centric task performance.
Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.
Community
Performs on par with Qwen3-VL-8B-Instruct on visual based tasks despite being half the size.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models (2025)
- CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks (2025)
- LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction (2026)
- Revisiting Multi-Task Visual Representation Learning (2026)
- VACoT: Rethinking Visual Data Augmentation with VLMs (2025)
- The Spatial Blindspot of Vision-Language Models (2026)
- Forest Before Trees: Latent Superposition for Efficient Visual Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Youtu-VL demo video
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper