https://github.com/TencentARC/Open-MAGVIT2

\n","updatedAt":"2024-09-09T03:34:28.484Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9180,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.745599091053009},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"66dfa1c65e4fd4bfeab11891","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-09-10T01:32:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling](https://huggingface.co/papers/2408.01181) (2024)\n* [Show-o: One Single Transformer to Unify Multimodal Understanding and Generation](https://huggingface.co/papers/2408.12528) (2024)\n* [Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining](https://huggingface.co/papers/2408.02657) (2024)\n* [Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation](https://huggingface.co/papers/2407.17274) (2024)\n* [DiffX: Guide Your Layout to Cross-Modal Generative Modeling](https://huggingface.co/papers/2407.15488) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-09-10T01:32:54.469Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7193984985351562},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2409.04410","authors":[{"_id":"66de6c935af9428df34917b4","user":{"_id":"657187de7644d1128571495e","avatarUrl":"/avatars/89412c94fd6136f6680055551de3ddc4.svg","isPro":false,"fullname":"Zhuoyan Luo","user":"RobertLuo1","type":"user"},"name":"Zhuoyan Luo","status":"admin_assigned","statusLastChangedAt":"2024-09-09T08:23:00.548Z","hidden":false},{"_id":"66de6c935af9428df34917b5","user":{"_id":"666c1232a7ca4800af0545d7","avatarUrl":"/avatars/b4713c86ee57f30aae478262634692fe.svg","isPro":false,"fullname":"Fengyuan Shi","user":"shifengyuan","type":"user"},"name":"Fengyuan Shi","status":"admin_assigned","statusLastChangedAt":"2024-09-09T08:26:16.830Z","hidden":false},{"_id":"66de6c935af9428df34917b6","user":{"_id":"640e9762b03f4cd29f58d982","avatarUrl":"/avatars/81da37d628163fe3e094b247c7c3a3b5.svg","isPro":false,"fullname":"Yixiao Ge","user":"yxgeee","type":"user"},"name":"Yixiao Ge","status":"admin_assigned","statusLastChangedAt":"2024-09-09T08:26:23.973Z","hidden":false},{"_id":"66de6c935af9428df34917b7","name":"Yujiu Yang","hidden":false},{"_id":"66de6c935af9428df34917b8","name":"Limin Wang","hidden":false},{"_id":"66de6c935af9428df34917b9","user":{"_id":"63ca3ddc04c979828310bfcb","avatarUrl":"/avatars/615e0d8622950b4408b40d550f02a894.svg","isPro":false,"fullname":"Ying Shan","user":"yshan2u","type":"user"},"name":"Ying Shan","status":"admin_assigned","statusLastChangedAt":"2024-09-09T08:27:34.082Z","hidden":false}],"publishedAt":"2024-09-06T17:14:53.000Z","submittedOnDailyAt":"2024-09-09T02:04:28.478Z","title":"Open-MAGVIT2: An Open-Source Project Toward Democratizing\n Auto-regressive Visual Generation","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We present Open-MAGVIT2, a family of auto-regressive image generation models\nranging from 300M to 1.5B. The Open-MAGVIT2 project produces an open-source\nreplication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large\ncodebook (i.e., 2^{18} codes), and achieves the state-of-the-art\nreconstruction performance (1.17 rFID) on ImageNet 256 times 256.\nFurthermore, we explore its application in plain auto-regressive models and\nvalidate scalability properties. To assist auto-regressive models in predicting\nwith a super-large vocabulary, we factorize it into two sub-vocabulary of\ndifferent sizes by asymmetric token factorization, and further introduce \"next\nsub-token prediction\" to enhance sub-token interaction for better generation\nquality. We release all models and codes to foster innovation and creativity in\nthe field of auto-regressive visual generation.","upvotes":25,"discussionId":"66de6c945af9428df3491914","ai_summary":"Open-MAGVIT2 models, ranging from 300M to 1.5B parameters, achieve state-of-the-art image reconstruction and exploration in auto-regressive models with super-large token vocabularies.","ai_keywords":["auto-regressive image generation","Open-MAGVIT2","MAGVIT-v2 tokenizer","super-large codebook","reconstruction performance","asymmetric token factorization","next sub-token prediction"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62aaaaf55a99fb2669bcd0e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1655352046059-noauth.jpeg","isPro":false,"fullname":"GaggiX","user":"GaggiX","type":"user"},{"_id":"65ba471ad88a65abb9328ee2","avatarUrl":"/avatars/956238ce5034091e64d026b0272c4400.svg","isPro":false,"fullname":"Dazhi Jiang","user":"thuzhizhi","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"645b91991c24cd669dd7618a","avatarUrl":"/avatars/7c8a7b7b0668239e96bc160f142e8de9.svg","isPro":false,"fullname":"z","user":"Huye2023","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"65ab7ed7d6b10af911dc8085","avatarUrl":"/avatars/0b033760f3a785ac82a1e913b28e9dec.svg","isPro":false,"fullname":"Tianren Ma","user":"Vivre","type":"user"},{"_id":"64a84de2eb47b3552285ef74","avatarUrl":"/avatars/114e0cc393d0aea9680f3af6d84d6f46.svg","isPro":false,"fullname":"Eni Grand","user":"Enigrand","type":"user"},{"_id":"643b19f8a856622f978df30f","avatarUrl":"/avatars/c82779fdf94f80cdb5020504f83c818b.svg","isPro":false,"fullname":"Yatharth Sharma","user":"YaTharThShaRma999","type":"user"},{"_id":"64513261938967fd069d2340","avatarUrl":"/avatars/e4c3c435f6a4cda57d0e2f16ec1cda6e.svg","isPro":false,"fullname":"sdtana","user":"sdtana","type":"user"},{"_id":"63d4c8ce13ae45b780792f32","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d4c8ce13ae45b780792f32/QasegimoxBqfZwDzorukz.png","isPro":false,"fullname":"Ohenenoo","user":"PeepDaSlan9","type":"user"},{"_id":"620c35eece371f5bad535d6e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669407156872-620c35eece371f5bad535d6e.jpeg","isPro":true,"fullname":"Andrew Pouliot","user":"darknoon","type":"user"},{"_id":"641b754d1911d3be6745cce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641b754d1911d3be6745cce9/Ydjcjd4VuNUGj5Cd4QHdB.png","isPro":false,"fullname":"atayloraerospace","user":"Taylor658","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3}">

Papers

arxiv:2409.04410

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Published on Sep 6, 2024

· Submitted by

AK on Sep 9, 2024

#3 Paper of the day

Upvote

Authors:

Zhuoyan Luo ,

Fengyuan Shi ,

Yixiao Ge ,

Ying Shan

Abstract

Open-MAGVIT2 models, ranging from 300M to 1.5B parameters, achieve state-of-the-art image reconstruction and exploration in auto-regressive models with super-large token vocabularies.

AI-generated summary

We present Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B. The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., 2^{18} codes), and achieves the state-of-the-art reconstruction performance (1.17 rFID) on ImageNet 256 times 256. Furthermore, we explore its application in plain auto-regressive models and validate scalability properties. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce "next sub-token prediction" to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.

View arXiv page View PDF Add to collection