https://github.com/Zehong-Ma/PixelGen.

\n","updatedAt":"2026-02-03T05:06:29.272Z","author":{"_id":"65d851096769b3a9c9376134","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d851096769b3a9c9376134/j_d2RSa-3rRmSAmRICSk0.jpeg","fullname":"ZehongMa","name":"zehongma","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7756271362304688},"editors":["zehongma"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65d851096769b3a9c9376134/j_d2RSa-3rRmSAmRICSk0.jpeg"],"reactions":[],"isReport":false}},{"id":"6982a444e346e0f1b500e169","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-04T01:43:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation](https://huggingface.co/papers/2512.07829) (2025)\n* [Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing](https://huggingface.co/papers/2512.17909) (2025)\n* [One-step Latent-free Image Generation with Pixel Mean Flows](https://huggingface.co/papers/2601.22158) (2026)\n* [REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion](https://huggingface.co/papers/2512.16636) (2025)\n* [VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training](https://huggingface.co/papers/2601.17830) (2026)\n* [DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation](https://huggingface.co/papers/2601.22904) (2026)\n* [Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders](https://huggingface.co/papers/2601.16208) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-04T01:43:32.201Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6902686953544617},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.02493","authors":[{"_id":"69818001ce18b186280962be","name":"Zehong Ma","hidden":false},{"_id":"69818001ce18b186280962bf","name":"Ruihan Xu","hidden":false},{"_id":"69818001ce18b186280962c0","name":"Shiliang Zhang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65d851096769b3a9c9376134/swcxkLWLfBFXKlI_lySBK.jpeg","https://cdn-uploads.huggingface.co/production/uploads/65d851096769b3a9c9376134/lLCDig4-nr-jeQlxcEiXv.jpeg"],"publishedAt":"2026-02-02T18:59:42.000Z","submittedOnDailyAt":"2026-02-03T02:36:29.264Z","title":"PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss","submittedOnDailyBy":{"_id":"65d851096769b3a9c9376134","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d851096769b3a9c9376134/j_d2RSa-3rRmSAmRICSk0.jpeg","isPro":false,"fullname":"ZehongMa","user":"zehongma","type":"user"},"summary":"Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at https://github.com/Zehong-Ma/PixelGen.","upvotes":42,"discussionId":"69818002ce18b186280962c1","projectPage":"https://zehong-ma.github.io/PixelGen/","githubRepo":"https://github.com/Zehong-Ma/PixelGen","githubRepoAddedBy":"user","ai_summary":"PixelGen is a pixel-space diffusion framework that uses perceptual supervision through LPIPS and DINO-based losses to generate high-quality images without requiring VAEs or latent representations.","ai_keywords":["pixel diffusion","perceptual supervision","LPIPS loss","DINO-based perceptual loss","diffusion model","image generation","VAE","latent representations"],"githubStars":185,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-uploads.huggingface.co/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65d851096769b3a9c9376134","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d851096769b3a9c9376134/j_d2RSa-3rRmSAmRICSk0.jpeg","isPro":false,"fullname":"ZehongMa","user":"zehongma","type":"user"},{"_id":"6641585a3b2643b9e2ef51f7","avatarUrl":"/avatars/921c8160066fab54ce40dd6c2a1b40dc.svg","isPro":false,"fullname":"Cong Cong","user":"ThomasCong","type":"user"},{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","isPro":false,"fullname":"wangshuai","user":"wangsssssss","type":"user"},{"_id":"66012e9c9e1cf5eb41ee0c4c","avatarUrl":"/avatars/b07240cb86315b9e33d14677e02e4024.svg","isPro":false,"fullname":"Jaewon Min","user":"Min-Jaewon","type":"user"},{"_id":"6444e7952d91b15b4c7c2350","avatarUrl":"/avatars/b685d7e76b5b7b20b94cb1f76c2e26fa.svg","isPro":false,"fullname":"c","user":"zenoc","type":"user"},{"_id":"669f5af5909583b4ebb6dd3a","avatarUrl":"/avatars/93784476059563459259b3da24ffc891.svg","isPro":false,"fullname":"Leo Fan","user":"LeoFan01","type":"user"},{"_id":"66fb955fe4dc863ee1ed0e9d","avatarUrl":"/avatars/8678e258b653d0d2538d08deea778bd8.svg","isPro":false,"fullname":"jcx","user":"xjc97","type":"user"},{"_id":"676a2ca72d7050defde9b25d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/qNWwTpxhHfbRSHdtdDhQl.png","isPro":false,"fullname":"Suger","user":"SugerWu","type":"user"},{"_id":"6462f9bac9cc74e82e282916","avatarUrl":"/avatars/0dd678eea7e81d5c1cfae8d39ac5aef0.svg","isPro":false,"fullname":"Zhang Zhilong","user":"zhangzl","type":"user"},{"_id":"65153adf25041026a4021a90","avatarUrl":"/avatars/2b3b4171cb2dbe9b838915e4db20d723.svg","isPro":false,"fullname":"Xiaozhu Ju","user":"Xiaozhu1985","type":"user"},{"_id":"6731f48e6c51a40ff85d44f6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/DGG4MtlGs3zALbiTYRf-m.png","isPro":false,"fullname":"Pengcheng Zhang","user":"zpc2090","type":"user"},{"_id":"64705d224be5cf1f3348d6bc","avatarUrl":"/avatars/270bff7c7cb326528dc192fc38561a8b.svg","isPro":false,"fullname":"Chi-Pin Huang","user":"jasper0314-huang","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-uploads.huggingface.co/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"}}">

Papers

arxiv:2602.02493

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Published on Feb 2

· Submitted by

ZehongMa on Feb 3

Peking University

Upvote

Authors:

Abstract

PixelGen is a pixel-space diffusion framework that uses perceptual supervision through LPIPS and DINO-based losses to generate high-quality images without requiring VAEs or latent representations.

AI-generated summary

Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at https://github.com/Zehong-Ma/PixelGen.