Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - In Pursuit of Pixel Supervision for Visual Pre-training
\n","updatedAt":"2025-12-18T18:20:23.721Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6785139441490173},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2512.15715","authors":[{"_id":"694370f5542d62d58a7bf6a0","name":"Lihe Yang","hidden":false},{"_id":"694370f5542d62d58a7bf6a1","user":{"_id":"64e3a7b8e12618b261fff364","avatarUrl":"/avatars/860a5011f150537eb923f76d354f1bc5.svg","isPro":false,"fullname":"Shang-Wen Daniel Li","user":"swdanielli","type":"user"},"name":"Shang-Wen Li","status":"claimed_verified","statusLastChangedAt":"2025-12-19T09:07:16.497Z","hidden":false},{"_id":"694370f5542d62d58a7bf6a2","name":"Yang Li","hidden":false},{"_id":"694370f5542d62d58a7bf6a3","name":"Xinjie Lei","hidden":false},{"_id":"694370f5542d62d58a7bf6a4","name":"Dong Wang","hidden":false},{"_id":"694370f5542d62d58a7bf6a5","name":"Abdelrahman Mohamed","hidden":false},{"_id":"694370f5542d62d58a7bf6a6","name":"Hengshuang Zhao","hidden":false},{"_id":"694370f5542d62d58a7bf6a7","name":"Hu Xu","hidden":false}],"publishedAt":"2025-12-17T18:59:58.000Z","submittedOnDailyAt":"2025-12-18T00:56:18.397Z","title":"In Pursuit of Pixel Supervision for Visual Pre-training","submittedOnDailyBy":{"_id":"65a3a0342548c41ad9f4e4e7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65a3a0342548c41ad9f4e4e7/80Yod9z7O95nC-Y0o-5UI.jpeg","isPro":false,"fullname":"Lihe Yang","user":"LiheYoung","type":"user"},"summary":"At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed \"Pixio\", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.","upvotes":11,"discussionId":"694370f5542d62d58a7bf6a8","projectPage":"https://github.com/facebookresearch/pixio","githubRepo":"https://github.com/facebookresearch/pixio","githubRepoAddedBy":"user","ai_summary":"Pixel-space self-supervised learning using an enhanced masked autoencoder achieves competitive performance across diverse downstream tasks while maintaining simplicity and efficiency.","ai_keywords":["autoencoders","self-supervised learning","masked autoencoder","MAE","pixel space","latent space","downstream tasks","Depth Anything","MapAnything","semantic segmentation","robot learning","DINOv3"],"githubStars":347},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65a3a0342548c41ad9f4e4e7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65a3a0342548c41ad9f4e4e7/80Yod9z7O95nC-Y0o-5UI.jpeg","isPro":false,"fullname":"Lihe Yang","user":"LiheYoung","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"65d86ea2685624d5f206d7ec","avatarUrl":"/avatars/b9bc4c398d5def393bc782e9a7c5e302.svg","isPro":false,"fullname":"Linshan Wu","user":"Luffy503","type":"user"},{"_id":"64e3a7b8e12618b261fff364","avatarUrl":"/avatars/860a5011f150537eb923f76d354f1bc5.svg","isPro":false,"fullname":"Shang-Wen Daniel Li","user":"swdanielli","type":"user"},{"_id":"640e26b03830fd441c2c106a","avatarUrl":"/avatars/8e457552ec454cbaa3ed9e138b29b47a.svg","isPro":false,"fullname":"Dong Wang","user":"dongwang218","type":"user"},{"_id":"62af665424488e6adfa9b8e2","avatarUrl":"/avatars/2bdb4a26fde4cbe5b4673e53e0d44540.svg","isPro":false,"fullname":"Edmond Jacoupeau","user":"edmond","type":"user"},{"_id":"664cb66017586a96342785c0","avatarUrl":"/avatars/a8fe303411c8c2f0bbd309b15a4c0026.svg","isPro":false,"fullname":"Wei Liu","user":"lefutonku","type":"user"},{"_id":"66e4165807c34507429cec2c","avatarUrl":"/avatars/a58d23f860b82cbb6aec99fbb3067fcb.svg","isPro":false,"fullname":"youdaoyzbx","user":"youdaoyzbx","type":"user"},{"_id":"63213080d2d45f3151837eba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63213080d2d45f3151837eba/aBhKfY-0gZhKGQmb_Gwi2.png","isPro":true,"fullname":"Dan Jacobellis","user":"danjacobellis","type":"user"},{"_id":"65025370b6595dc45c397340","avatarUrl":"/avatars/9469599b176034548042922c0afa7051.svg","isPro":false,"fullname":"J C","user":"dark-pen","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Pixel-space self-supervised learning using an enhanced masked autoencoder achieves competitive performance across diverse downstream tasks while maintaining simplicity and efficiency.
AI-generated summary
At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.