Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-01-29T01:39:20.023Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6830723285675049},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"697c2630225145c144ea805a","author":{"_id":"66091e552c198b9518772591","avatarUrl":"/avatars/26e0c818bcd449b17f65463f7ee277f1.svg","fullname":"Zane","name":"zanekan01","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-01-30T03:32:00.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Youtu-VL demo video\nhttps://cdn-uploads.huggingface.co/production/uploads/66091e552c198b9518772591/XMOeF9Oj6AvktO2uaNunR.qt\n","html":"

Youtu-VL demo video

\n

\n","updatedAt":"2026-01-30T03:32:00.633Z","author":{"_id":"66091e552c198b9518772591","avatarUrl":"/avatars/26e0c818bcd449b17f65463f7ee277f1.svg","fullname":"Zane","name":"zanekan01","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.30423110723495483},"editors":["zanekan01"],"editorAvatarUrls":["/avatars/26e0c818bcd449b17f65463f7ee277f1.svg"],"reactions":[{"reaction":"👍","users":["fujikoli","tedsun","kamamuta"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.19798","authors":[{"_id":"6979818fdf44b75fa47e477a","user":{"_id":"656312995475849b82c38bc4","avatarUrl":"/avatars/fd52f3a76a1ca475585d13c3f9e50c54.svg","isPro":true,"fullname":"zhixiangwei","user":"zhixiangwei","type":"user"},"name":"Zhixiang Wei","status":"claimed_verified","statusLastChangedAt":"2026-01-29T09:17:26.529Z","hidden":false},{"_id":"6979818fdf44b75fa47e477b","name":"Yi Li","hidden":false},{"_id":"6979818fdf44b75fa47e477c","user":{"_id":"66091e552c198b9518772591","avatarUrl":"/avatars/26e0c818bcd449b17f65463f7ee277f1.svg","isPro":false,"fullname":"Zane","user":"zanekan01","type":"user"},"name":"Zhehan Kan","status":"claimed_verified","statusLastChangedAt":"2026-01-30T09:37:34.532Z","hidden":false},{"_id":"6979818fdf44b75fa47e477d","name":"Xinghua Jiang","hidden":false},{"_id":"6979818fdf44b75fa47e477e","name":"Zuwei Long","hidden":false},{"_id":"6979818fdf44b75fa47e477f","user":{"_id":"674d5157ceb0c4f251d7f58d","avatarUrl":"/avatars/2c712d31932ceee42414628ea456a993.svg","isPro":false,"fullname":"ShifengLiu","user":"ShifengLiu","type":"user"},"name":"Shifeng Liu","status":"claimed_verified","statusLastChangedAt":"2026-01-30T09:37:36.596Z","hidden":false},{"_id":"6979818fdf44b75fa47e4780","name":"Hongze Shen","hidden":false},{"_id":"6979818fdf44b75fa47e4781","name":"Wei Liu","hidden":false},{"_id":"6979818fdf44b75fa47e4782","user":{"_id":"637af0a7bdf7309aa6da1c36","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637af0a7bdf7309aa6da1c36/NHZ-09otVCfbpXVxm8f-e.png","isPro":false,"fullname":"Xiaoyu Tan","user":"WIlliam1900","type":"user"},"name":"Xiaoyu Tan","status":"claimed_verified","statusLastChangedAt":"2026-01-29T09:17:24.379Z","hidden":false},{"_id":"6979818fdf44b75fa47e4783","name":"Haojia Lin","hidden":false},{"_id":"6979818fdf44b75fa47e4784","user":{"_id":"698aaaf5599fceb2ec8bbd20","avatarUrl":"/avatars/7bb16fdf780aae64f631d4c77bb4605d.svg","isPro":false,"fullname":"Yubo Zhu","user":"yuboz","type":"user"},"name":"Yubo Zhu","status":"claimed_verified","statusLastChangedAt":"2026-02-10T09:08:18.967Z","hidden":false},{"_id":"6979818fdf44b75fa47e4785","name":"Qianyu Li","hidden":false},{"_id":"6979818fdf44b75fa47e4786","user":{"_id":"63fc75f9b9db84750cea9c5c","avatarUrl":"/avatars/2c5bf9685e0cfc4b5785a4a86c34e0db.svg","isPro":false,"fullname":"DI YIN","user":"DIYIN","type":"user"},"name":"Di Yin","status":"claimed_verified","statusLastChangedAt":"2026-01-28T11:16:25.761Z","hidden":false},{"_id":"6979818fdf44b75fa47e4787","user":{"_id":"641c1a85999935676ec7ead4","avatarUrl":"/avatars/738a4e43bb3ddd4de287acfa26095b3d.svg","isPro":false,"fullname":"caohaoyu","user":"rechy","type":"user"},"name":"Haoyu Cao","status":"claimed_verified","statusLastChangedAt":"2026-01-30T09:37:30.988Z","hidden":false},{"_id":"6979818fdf44b75fa47e4788","name":"Weibo Gu","hidden":false},{"_id":"6979818fdf44b75fa47e4789","user":{"_id":"667016b9ccff4d0862460d88","avatarUrl":"/avatars/96959ed963af9eae2814079c5241af1c.svg","isPro":false,"fullname":"Xin Li","user":"fujikoli","type":"user"},"name":"Xin Li","status":"claimed_verified","statusLastChangedAt":"2026-01-28T11:28:49.228Z","hidden":false},{"_id":"6979818fdf44b75fa47e478a","user":{"_id":"6959daf62da262f6caa6b3b9","avatarUrl":"/avatars/3986907912cd2d502570b8969b02abcc.svg","isPro":false,"fullname":"Yinsong Liu","user":"Yinsongliu","type":"user"},"name":"Yinsong Liu","status":"claimed_verified","statusLastChangedAt":"2026-01-28T11:16:33.643Z","hidden":false},{"_id":"6979818fdf44b75fa47e478b","name":"Deqiang Jiang","hidden":false},{"_id":"6979818fdf44b75fa47e478c","user":{"_id":"647401e50da364bd0d002f2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/vPuPn7EV092mLBOM2YZXd.png","isPro":false,"fullname":"XING SUN","user":"tedsun","type":"user"},"name":"Xing Sun","status":"claimed_verified","statusLastChangedAt":"2026-01-28T11:16:28.308Z","hidden":false},{"_id":"6979818fdf44b75fa47e478d","name":"Yunsheng Wu","hidden":false},{"_id":"6979818fdf44b75fa47e478e","name":"Mingkong Tang","hidden":false},{"_id":"6979818fdf44b75fa47e478f","name":"Shuangyin Liu","hidden":false},{"_id":"6979818fdf44b75fa47e4790","name":"Lexiang Tang","hidden":false},{"_id":"6979818fdf44b75fa47e4791","user":{"_id":"68006d7f9c4da89b98b95d08","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68006d7f9c4da89b98b95d08/3oFLjqEQ7kN4z4Hmura8C.jpeg","isPro":false,"fullname":"HaoDong Lin","user":"linhaodong19","type":"user"},"name":"Haodong Lin","status":"claimed_verified","statusLastChangedAt":"2026-01-30T09:37:28.945Z","hidden":false},{"_id":"6979818fdf44b75fa47e4792","user":{"_id":"64ddbc3f1f2dad27e1a05ac1","avatarUrl":"/avatars/4fb2753b7998c8536bfd4780d3b10a6d.svg","isPro":false,"fullname":"Junrulu","user":"Junrulu","type":"user"},"name":"Junru Lu","status":"claimed_verified","statusLastChangedAt":"2026-01-28T11:16:30.779Z","hidden":false},{"_id":"6979818fdf44b75fa47e4793","user":{"_id":"64181d03edc5a69a66959b8a","avatarUrl":"/avatars/ca1748bc8fe0742158d836302c4292c7.svg","isPro":false,"fullname":"JR QIN","user":"qinjr","type":"user"},"name":"Jiarui Qin","status":"claimed_verified","statusLastChangedAt":"2026-01-29T09:17:35.396Z","hidden":false},{"_id":"6979818fdf44b75fa47e4794","name":"Lingfeng Qiao","hidden":false},{"_id":"6979818fdf44b75fa47e4795","name":"Ruizhi Qiao","hidden":false},{"_id":"6979818fdf44b75fa47e4796","name":"Bo Ke","hidden":false},{"_id":"6979818fdf44b75fa47e4797","name":"Jianfeng He","hidden":false},{"_id":"6979818fdf44b75fa47e4798","name":"Ke Li","hidden":false},{"_id":"6979818fdf44b75fa47e4799","name":"Yangning Li","hidden":false},{"_id":"6979818fdf44b75fa47e479a","name":"Yunhang Shen","hidden":false},{"_id":"6979818fdf44b75fa47e479b","name":"Mengdan Zhang","hidden":false},{"_id":"6979818fdf44b75fa47e479c","name":"Peixian Chen","hidden":false},{"_id":"6979818fdf44b75fa47e479d","name":"Kun Yin","hidden":false},{"_id":"6979818fdf44b75fa47e479e","name":"Bing Liu","hidden":false},{"_id":"6979818fdf44b75fa47e479f","name":"Yunfei Wu","hidden":false},{"_id":"6979818fdf44b75fa47e47a0","user":{"_id":"641c05d80742e5d0e21782d2","avatarUrl":"/avatars/40e0b0d22bd1992cfba7569a2f655ed0.svg","isPro":false,"fullname":"Huang Chen","user":"double22a","type":"user"},"name":"Huang Chen","status":"claimed_verified","statusLastChangedAt":"2026-01-30T09:37:26.965Z","hidden":false},{"_id":"6979818fdf44b75fa47e47a1","name":"Zhongpeng Cai","hidden":false},{"_id":"6979818fdf44b75fa47e47a2","name":"Xiaotian Li","hidden":false}],"publishedAt":"2026-01-27T17:01:16.000Z","submittedOnDailyAt":"2026-01-28T19:57:05.016Z","title":"Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision","submittedOnDailyBy":{"_id":"610a70f35a40a8bfebfbf09b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1659922312540-610a70f35a40a8bfebfbf09b.jpeg","isPro":true,"fullname":"Daniel Bourke","user":"mrdbourke","type":"user"},"summary":"Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.","upvotes":42,"discussionId":"69798190df44b75fa47e47a3","projectPage":"https://youtu-tip.com/#llm","githubRepo":"https://github.com/TencentCloudADP/youtu-vl","githubRepoAddedBy":"user","ai_summary":"Youtu-VL addresses limitations in Vision-Language Models by introducing a unified autoregressive supervision paradigm that treats visual signals as target outputs rather than passive inputs, enabling improved multimodal comprehension and vision-centric task performance.","ai_keywords":["Vision-Language Models","autoregressive supervision","vision-as-target","vision-as-input","visual tokens","multimodal comprehension","vision-centric tasks","generalist visual agents"],"githubStars":133,"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6959daf62da262f6caa6b3b9","avatarUrl":"/avatars/3986907912cd2d502570b8969b02abcc.svg","isPro":false,"fullname":"Yinsong Liu","user":"Yinsongliu","type":"user"},{"_id":"64ddbc3f1f2dad27e1a05ac1","avatarUrl":"/avatars/4fb2753b7998c8536bfd4780d3b10a6d.svg","isPro":false,"fullname":"Junrulu","user":"Junrulu","type":"user"},{"_id":"66091e552c198b9518772591","avatarUrl":"/avatars/26e0c818bcd449b17f65463f7ee277f1.svg","isPro":false,"fullname":"Zane","user":"zanekan01","type":"user"},{"_id":"674d5157ceb0c4f251d7f58d","avatarUrl":"/avatars/2c712d31932ceee42414628ea456a993.svg","isPro":false,"fullname":"ShifengLiu","user":"ShifengLiu","type":"user"},{"_id":"662b1e9aef7a4675bdff8bd0","avatarUrl":"/avatars/afc0628998316a9226f27976ca8c8376.svg","isPro":false,"fullname":"Yi Li","user":"yili7eli","type":"user"},{"_id":"69326ac4edf0cd8d0530adcb","avatarUrl":"/avatars/40e990a148e75053fce9c06cfe015b18.svg","isPro":false,"fullname":"Yubo Zhu","user":"efdsf324","type":"user"},{"_id":"6372813520a58a5e14c596a3","avatarUrl":"/avatars/9135151259db3e5b9c8969e1d00c949d.svg","isPro":false,"fullname":"XuHao Hu","user":"Foreshhh","type":"user"},{"_id":"647401e50da364bd0d002f2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/vPuPn7EV092mLBOM2YZXd.png","isPro":false,"fullname":"XING SUN","user":"tedsun","type":"user"},{"_id":"63fc75f9b9db84750cea9c5c","avatarUrl":"/avatars/2c5bf9685e0cfc4b5785a4a86c34e0db.svg","isPro":false,"fullname":"DI YIN","user":"DIYIN","type":"user"},{"_id":"63e9df3746574e63a2cc55c5","avatarUrl":"/avatars/3d27ad1ccfa51387e4b97d02e13deb41.svg","isPro":false,"fullname":"Lingfeng Qiao","user":"leafqiaoqiao","type":"user"},{"_id":"64181d03edc5a69a66959b8a","avatarUrl":"/avatars/ca1748bc8fe0742158d836302c4292c7.svg","isPro":false,"fullname":"JR QIN","user":"qinjr","type":"user"},{"_id":"64b02ec0e5000ae8a572ced5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b02ec0e5000ae8a572ced5/6ifLntBU2ICQK7SW8WxKU.png","isPro":false,"fullname":"Lin Chen","user":"Lin-Chen","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"}}">
Papers
arxiv:2601.19798

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

Published on Jan 27
· Submitted by
Daniel Bourke
on Jan 28
Authors:
,
,
,
,
,
,
,
Di Yin ,
,
Xin Li ,
,
,
,

Abstract

Youtu-VL addresses limitations in Vision-Language Models by introducing a unified autoregressive supervision paradigm that treats visual signals as target outputs rather than passive inputs, enabling improved multimodal comprehension and vision-centric task performance.

AI-generated summary

Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.

Community

Paper submitter

Performs on par with Qwen3-VL-8B-Instruct on visual based tasks despite being half the size.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Paper author

Youtu-VL demo video

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.19798 in a dataset README.md to link it from this page.

Spaces citing this paper 2

Collections including this paper 4