Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - "Principal Components" Enable A New Language of Images
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-03-14T01:35:36.395Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6914785504341125},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.08685","authors":[{"_id":"67d0f7032eaba9be7bf76e0e","user":{"_id":"63483629ac5172169929da0e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665676793089-noauth.jpeg","isPro":false,"fullname":"Xin Wen","user":"xwen99","type":"user"},"name":"Xin Wen","status":"claimed_verified","statusLastChangedAt":"2025-03-12T08:39:00.455Z","hidden":false},{"_id":"67d0f7032eaba9be7bf76e0f","user":{"_id":"62dcd71075e9787ec5aa41ba","avatarUrl":"/avatars/f37ce036b76180ed0fa004f9c8c09363.svg","isPro":true,"fullname":"Bingchen Zhao","user":"tennant","type":"user"},"name":"Bingchen Zhao","status":"claimed_verified","statusLastChangedAt":"2025-03-12T08:38:56.945Z","hidden":false},{"_id":"67d0f7032eaba9be7bf76e10","name":"Ismail Elezi","hidden":false},{"_id":"67d0f7032eaba9be7bf76e11","name":"Jiankang Deng","hidden":false},{"_id":"67d0f7032eaba9be7bf76e12","user":{"_id":"6875266f9cd3191dfddc7071","avatarUrl":"/avatars/64c581910833b111e9a7bae5b8740229.svg","isPro":false,"fullname":"xiaojuan qi","user":"xjqi","type":"user"},"name":"Xiaojuan Qi","status":"claimed_verified","statusLastChangedAt":"2025-07-15T19:13:38.415Z","hidden":false}],"publishedAt":"2025-03-11T17:59:41.000Z","submittedOnDailyAt":"2025-03-12T01:23:22.901Z","title":"\"Principal Components\" Enable A New Language of Images","submittedOnDailyBy":{"_id":"63483629ac5172169929da0e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665676793089-noauth.jpeg","isPro":false,"fullname":"Xin Wen","user":"xwen99","type":"user"},"summary":"We introduce a novel visual tokenization framework that embeds a provable\nPCA-like structure into the latent token space. While existing visual\ntokenizers primarily optimize for reconstruction fidelity, they often neglect\nthe structural properties of the latent space -- a critical factor for both\ninterpretability and downstream tasks. Our method generates a 1D causal token\nsequence for images, where each successive token contributes non-overlapping\ninformation with mathematically guaranteed decreasing explained variance,\nanalogous to principal component analysis. This structural constraint ensures\nthe tokenizer extracts the most salient visual features first, with each\nsubsequent token adding diminishing yet complementary information.\nAdditionally, we identified and resolved a semantic-spectrum coupling effect\nthat causes the unwanted entanglement of high-level semantic content and\nlow-level spectral details in the tokens by leveraging a diffusion decoder.\nExperiments demonstrate that our approach achieves state-of-the-art\nreconstruction performance and enables better interpretability to align with\nthe human vision system. Moreover, auto-regressive models trained on our token\nsequences achieve performance comparable to current state-of-the-art methods\nwhile requiring fewer tokens for training and inference.","upvotes":12,"discussionId":"67d0f7052eaba9be7bf76eac","projectPage":"https://visual-gen.github.io/semanticist/","githubRepo":"https://github.com/visual-gen/semanticist","githubRepoAddedBy":"user","ai_summary":"A novel visual tokenization framework enhances reconstruction performance and interpretability by embedding a PCA-like structure into the latent token space, resolving semantic-spectrum coupling with a diffusion decoder.","ai_keywords":["visual tokenization","PCA-like structure","latent token space","token sequence","principal component analysis","diffusion decoder","auto-regressive models"],"githubStars":79},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63483629ac5172169929da0e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665676793089-noauth.jpeg","isPro":false,"fullname":"Xin Wen","user":"xwen99","type":"user"},{"_id":"66f612b934b8ac9ffa44f084","avatarUrl":"/avatars/6836c122e19c66c90f1673f28b30d7f0.svg","isPro":false,"fullname":"Tang","user":"tommysally","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"63c5d43ae2804cb2407e4d43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1673909278097-noauth.png","isPro":false,"fullname":"xziayro","user":"xziayro","type":"user"},{"_id":"61e52be53d6dbb1da842316a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61e52be53d6dbb1da842316a/gx0WGPcOCClXPymoKglc4.jpeg","isPro":false,"fullname":"Börje Karlsson","user":"tellarin","type":"user"},{"_id":"64738c198b7a55cfa91ebb00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64738c198b7a55cfa91ebb00/qd_550SJq921zhA0l1aUY.jpeg","isPro":false,"fullname":"Ming Ryan","user":"yym68686","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"63477bb66f8773f2a28daa15","avatarUrl":"/avatars/9a369763a73278cddcf2abcae594865d.svg","isPro":false,"fullname":"Dhruv Diddi","user":"ddiddi","type":"user"},{"_id":"63a475d827f1f64ed723a038","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1671722419765-noauth.jpeg","isPro":false,"fullname":"WonJae Roh","user":"snuro","type":"user"},{"_id":"646d239f4220471ca0c6471c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646d239f4220471ca0c6471c/sRwzko8XEUVCkeD7jXceH.jpeg","isPro":false,"fullname":"Guy Yariv","user":"GuyYariv","type":"user"},{"_id":"6784a11952837d00f2e173a9","avatarUrl":"/avatars/b50439d12e2bc1ba4c2a356e243b97c0.svg","isPro":false,"fullname":"Jonghyuk Baek","user":"JH-BK","type":"user"},{"_id":"661c9059bcd78151e5c06ea1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661c9059bcd78151e5c06ea1/27bfNo1LZeZQ77vWuAa10.png","isPro":false,"fullname":"Ju He","user":"turkeyju","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2503.08685

"Principal Components" Enable A New Language of Images

Published on Mar 11, 2025
· Submitted by
Xin Wen
on Mar 12, 2025
Authors:
,
,

Abstract

A novel visual tokenization framework enhances reconstruction performance and interpretability by embedding a PCA-like structure into the latent token space, resolving semantic-spectrum coupling with a diffusion decoder.

AI-generated summary

We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. While existing visual tokenizers primarily optimize for reconstruction fidelity, they often neglect the structural properties of the latent space -- a critical factor for both interpretability and downstream tasks. Our method generates a 1D causal token sequence for images, where each successive token contributes non-overlapping information with mathematically guaranteed decreasing explained variance, analogous to principal component analysis. This structural constraint ensures the tokenizer extracts the most salient visual features first, with each subsequent token adding diminishing yet complementary information. Additionally, we identified and resolved a semantic-spectrum coupling effect that causes the unwanted entanglement of high-level semantic content and low-level spectral details in the tokens by leveraging a diffusion decoder. Experiments demonstrate that our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system. Moreover, auto-regressive models trained on our token sequences achieve performance comparable to current state-of-the-art methods while requiring fewer tokens for training and inference.

Community

Paper author Paper submitter

"Principal Components" Enable A New Language of Images

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.08685 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.08685 in a dataset README.md to link it from this page.

Spaces citing this paper 2

Collections including this paper 5