Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Vision Foundation Models as Effective Visual Tokenizers for
Autoregressive Image Generation
https://arxivexplained.com/papers/vision-foundation-models-as-effective-visual-tokenizers-for-autoregressive-image-generation\n","updatedAt":"2025-11-12T03:05:28.870Z","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.567109227180481},"editors":["grantsing"],"editorAvatarUrls":["/avatars/3aedb9522cc3cd08349d654f523fd792.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2507.08441","authors":[{"_id":"6874b80f257d4f04353703a8","name":"Anlin Zheng","hidden":false},{"_id":"6874b80f257d4f04353703a9","user":{"_id":"63483629ac5172169929da0e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665676793089-noauth.jpeg","isPro":false,"fullname":"Xin Wen","user":"xwen99","type":"user"},"name":"Xin Wen","status":"claimed_verified","statusLastChangedAt":"2025-07-15T19:10:26.461Z","hidden":false},{"_id":"6874b80f257d4f04353703aa","user":{"_id":"6675854966c4fa6d0cee4d50","avatarUrl":"/avatars/aa6041a97985078e82cc89bfbade9828.svg","isPro":false,"fullname":"xuanyang zhang","user":"xuanyangz","type":"user"},"name":"Xuanyang Zhang","status":"claimed_verified","statusLastChangedAt":"2025-07-15T19:10:22.334Z","hidden":false},{"_id":"6874b80f257d4f04353703ab","name":"Chuofan Ma","hidden":false},{"_id":"6874b80f257d4f04353703ac","name":"Tiancai Wang","hidden":false},{"_id":"6874b80f257d4f04353703ad","name":"Gang Yu","hidden":false},{"_id":"6874b80f257d4f04353703ae","name":"Xiangyu Zhang","hidden":false},{"_id":"6874b80f257d4f04353703af","user":{"_id":"6875266f9cd3191dfddc7071","avatarUrl":"/avatars/64c581910833b111e9a7bae5b8740229.svg","isPro":false,"fullname":"xiaojuan qi","user":"xjqi","type":"user"},"name":"Xiaojuan Qi","status":"claimed_verified","statusLastChangedAt":"2025-07-15T19:10:24.424Z","hidden":false}],"publishedAt":"2025-07-11T09:32:45.000Z","submittedOnDailyAt":"2025-07-14T06:26:25.095Z","title":"Vision Foundation Models as Effective Visual Tokenizers for\n Autoregressive Image Generation","submittedOnDailyBy":{"_id":"63483629ac5172169929da0e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665676793089-noauth.jpeg","isPro":false,"fullname":"Xin Wen","user":"xwen99","type":"user"},"summary":"Leveraging the powerful representations of pre-trained vision foundation\nmodels -- traditionally used for visual comprehension -- we explore a novel\ndirection: building an image tokenizer directly atop such models, a largely\nunderexplored area. Specifically, we employ a frozen vision foundation model as\nthe encoder of our tokenizer. To enhance its effectiveness, we introduce two\nkey components: (1) a region-adaptive quantization framework that reduces\nredundancy in the pre-trained features on regular 2D grids, and (2) a semantic\nreconstruction objective that aligns the tokenizer's outputs with the\nfoundation model's representations to preserve semantic fidelity. Based on\nthese designs, our proposed image tokenizer, VFMTok, achieves substantial\nimprovements in image reconstruction and generation quality, while also\nenhancing token efficiency. It further boosts autoregressive (AR) generation --\nachieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model\nconvergence by three times, and enabling high-fidelity class-conditional\nsynthesis without the need for classifier-free guidance (CFG). The code will be\nreleased publicly to benefit the community.","upvotes":62,"discussionId":"6874b80f257d4f04353703b0","ai_summary":"A novel image tokenizer built on pre-trained vision foundation models improves image reconstruction, generation quality, and token efficiency, enhancing autoregressive generation and class-conditional synthesis.","ai_keywords":["vision foundation models","image tokenizer","region-adaptive quantization","semantic reconstruction objective","VFMTok","gFID","autoregressive generation","class-conditional synthesis","classifier-free guidance"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63483629ac5172169929da0e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665676793089-noauth.jpeg","isPro":false,"fullname":"Xin Wen","user":"xwen99","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64ea59beb36ed038b6638ece","avatarUrl":"/avatars/a74b3c8b63b5ca8ebb3a00455f6f803f.svg","isPro":false,"fullname":"Slava","user":"wertlon","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"677272184d148b904333e874","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/5dUau7gxLk4Wm1TiiJJri.jpeg","isPro":false,"fullname":"Efstathios Karypidis","user":"Sta8is","type":"user"},{"_id":"65bb837dbfb878f46c77de4c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bb837dbfb878f46c77de4c/23gZ_lBEwyoqjexFy9QLD.jpeg","isPro":true,"fullname":"Prithiv Sakthi","user":"prithivMLmods","type":"user"},{"_id":"6875266f9cd3191dfddc7071","avatarUrl":"/avatars/64c581910833b111e9a7bae5b8740229.svg","isPro":false,"fullname":"xiaojuan qi","user":"xjqi","type":"user"},{"_id":"63e8732dfdb4097ef6696209","avatarUrl":"/avatars/0b240f8cc221efcad3c61850073a4576.svg","isPro":false,"fullname":"Zhongrui Wang","user":"zhongruiwang","type":"user"},{"_id":"63213080d2d45f3151837eba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63213080d2d45f3151837eba/aBhKfY-0gZhKGQmb_Gwi2.png","isPro":true,"fullname":"Dan Jacobellis","user":"danjacobellis","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"63f1d16fbe95ed4c9a9418fe","avatarUrl":"/avatars/a1bdfa97323693808f2f16ec74698ed3.svg","isPro":false,"fullname":"Yang Yue","user":"yueyang2000","type":"user"},{"_id":"686b6a3d9031de40d378364e","avatarUrl":"/avatars/ce701c247d0ae814a73f8b6163266ce2.svg","isPro":false,"fullname":"ma","user":"zjcs5498","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A novel image tokenizer built on pre-trained vision foundation models improves image reconstruction, generation quality, and token efficiency, enhancing autoregressive generation and class-conditional synthesis.
AI-generated summary
Leveraging the powerful representations of pre-trained vision foundation
models -- traditionally used for visual comprehension -- we explore a novel
direction: building an image tokenizer directly atop such models, a largely
underexplored area. Specifically, we employ a frozen vision foundation model as
the encoder of our tokenizer. To enhance its effectiveness, we introduce two
key components: (1) a region-adaptive quantization framework that reduces
redundancy in the pre-trained features on regular 2D grids, and (2) a semantic
reconstruction objective that aligns the tokenizer's outputs with the
foundation model's representations to preserve semantic fidelity. Based on
these designs, our proposed image tokenizer, VFMTok, achieves substantial
improvements in image reconstruction and generation quality, while also
enhancing token efficiency. It further boosts autoregressive (AR) generation --
achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model
convergence by three times, and enabling high-fidelity class-conditional
synthesis without the need for classifier-free guidance (CFG). The code will be
released publicly to benefit the community.