https://github.com/galilai-group/cjepa

\n","updatedAt":"2026-02-18T09:34:28.750Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6241617798805237},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[],"isReport":false}},{"id":"699669f3ab8a382825191b3f","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-19T01:40:03.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model](https://huggingface.co/papers/2602.10098) (2026)\n* [Olaf-World: Orienting Latent Actions for Video World Modeling](https://huggingface.co/papers/2602.10104) (2026)\n* [LCLA: Language-Conditioned Latent Alignment for Vision-Language Navigation](https://huggingface.co/papers/2602.07629) (2026)\n* [Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models](https://huggingface.co/papers/2602.01166) (2026)\n* [Causal World Modeling for Robot Control](https://huggingface.co/papers/2601.21998) (2026)\n* [Evaluating Object-Centric Models beyond Object Discovery](https://huggingface.co/papers/2602.07532) (2026)\n* [Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments](https://huggingface.co/papers/2601.01075) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-19T01:40:03.463Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6664091348648071},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.11389","authors":[{"_id":"69957f88ed493589ceb5be08","user":{"_id":"659f9445d5c4ea912705aa4d","avatarUrl":"/avatars/1d3297c3ccad48e5eb6c01e0640dc06d.svg","isPro":false,"fullname":"Heejeong Nam","user":"HazelNam","type":"user"},"name":"Heejeong Nam","status":"admin_assigned","statusLastChangedAt":"2026-02-18T13:45:40.276Z","hidden":false},{"_id":"69957f88ed493589ceb5be09","user":{"_id":"69335e48ad96e22e49797a56","avatarUrl":"/avatars/59805856410bc27fed49deecd0b52601.svg","isPro":false,"fullname":"Quentin Le Lidec","user":"quentinll","type":"user"},"name":"Quentin Le Lidec","status":"admin_assigned","statusLastChangedAt":"2026-02-18T13:45:46.221Z","hidden":false},{"_id":"69957f88ed493589ceb5be0a","name":"Lucas Maes","hidden":false},{"_id":"69957f88ed493589ceb5be0b","user":{"_id":"64ed0b8c2203a126eb1a5b9a","avatarUrl":"/avatars/9156dc406ed3f9ee62b73657ac20f5ed.svg","isPro":false,"fullname":"Yann LeCun","user":"ylecun","type":"user"},"name":"Yann LeCun","status":"admin_assigned","statusLastChangedAt":"2026-02-18T13:46:02.392Z","hidden":false},{"_id":"69957f88ed493589ceb5be0c","user":{"_id":"6629cf0121e410e67c4de179","avatarUrl":"/avatars/157637f7716849675fe44da005fd3ede.svg","isPro":false,"fullname":"Randall Balestriero","user":"RandallBalestriero","type":"user"},"name":"Randall Balestriero","status":"admin_assigned","statusLastChangedAt":"2026-02-18T13:45:55.816Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/5f1158120c833276f61f1a84/TVDm4jvyCinbK9ebhON6H.png"],"publishedAt":"2026-02-11T21:47:26.000Z","submittedOnDailyAt":"2026-02-18T07:04:28.743Z","title":"Causal-JEPA: Learning World Models through Object-Level Latent Interventions","submittedOnDailyBy":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","isPro":false,"fullname":"Niels Rogge","user":"nielsr","type":"user"},"summary":"World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20\\% in counterfactual reasoning compared to the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1\\% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at https://github.com/galilai-group/cjepa.","upvotes":4,"discussionId":"69957f88ed493589ceb5be0d","githubRepo":"https://github.com/galilai-group/cjepa","githubRepoAddedBy":"user","ai_summary":"C-JEPA extends masked joint embedding prediction to object-centric representations, enabling robust relational understanding through object-level masking that induces causal inductive biases and improves reasoning and control tasks.","ai_keywords":["object-centric representations","masked joint embedding prediction","counterfactual reasoning","latent interventions","causal inductive bias","visual question answering","agent control tasks"],"githubStars":29,"organization":{"_id":"620e88a5b6c1d205f69ed06b","name":"brownu","fullname":"Brown University","avatar":"https://cdn-uploads.huggingface.co/production/uploads/63e8e8f2ccae1fe5c61ad852/tBcw2mYI7wPCW319qd6F6.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"684d57f26e04c265777ead3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cuOj-bQqukSZreXgUJlfm.png","isPro":false,"fullname":"Joakim Lee","user":"Reinforcement4All","type":"user"},{"_id":"659f9445d5c4ea912705aa4d","avatarUrl":"/avatars/1d3297c3ccad48e5eb6c01e0640dc06d.svg","isPro":false,"fullname":"Heejeong Nam","user":"HazelNam","type":"user"},{"_id":"60eeedbf50b60c406afc1291","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1649111275459-60eeedbf50b60c406afc1291.png","isPro":false,"fullname":"Samuel Arcadinho","user":"SSamDav","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"620e88a5b6c1d205f69ed06b","name":"brownu","fullname":"Brown University","avatar":"https://cdn-uploads.huggingface.co/production/uploads/63e8e8f2ccae1fe5c61ad852/tBcw2mYI7wPCW319qd6F6.png"}}">

Papers

arxiv:2602.11389

Causal-JEPA: Learning World Models through Object-Level Latent Interventions

Published on Feb 11

· Submitted by

Niels Rogge on Feb 18

Brown University

Upvote

Authors:

Heejeong Nam ,

Quentin Le Lidec ,

Yann LeCun ,

Randall Balestriero

Abstract

C-JEPA extends masked joint embedding prediction to object-centric representations, enabling robust relational understanding through object-level masking that induces causal inductive biases and improves reasoning and control tasks.

AI-generated summary

World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20\% in counterfactual reasoning compared to the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1\% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at https://github.com/galilai-group/cjepa.