https://hila-chefer.github.io/videojam-paper.github.io/

\n","updatedAt":"2025-02-05T05:29:47.762Z","author":{"_id":"6181c72cdcc1df2c9de8a4d8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1655248010394-6181c72cdcc1df2c9de8a4d8.jpeg","fullname":"Hila Chefer","name":"Hila","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":18,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9413583874702454},"editors":["Hila"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1655248010394-6181c72cdcc1df2c9de8a4d8.jpeg"],"reactions":[{"reaction":"🔥","users":["otterpupp","tolgacangoz"],"count":2}],"isReport":false}},{"id":"67a411a621b9a08f7ecae1c6","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-02-06T01:34:30.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Large Motion Video Autoencoding with Cross-modal Video VAE](https://huggingface.co/papers/2412.17805) (2024)\n* [Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation](https://huggingface.co/papers/2412.06016) (2024)\n* [Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss](https://huggingface.co/papers/2501.07563) (2025)\n* [Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation](https://huggingface.co/papers/2501.03059) (2025)\n* [DiffVSR: Enhancing Real-World Video Super-Resolution with Diffusion Models for Advanced Visual Quality and Temporal Consistency](https://huggingface.co/papers/2501.10110) (2025)\n* [VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation](https://huggingface.co/papers/2412.16677) (2024)\n* [Motion-Aware Generative Frame Interpolation](https://huggingface.co/papers/2501.03699) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-02-06T01:34:30.662Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7029787302017212},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"👍","users":["JackCrum"],"count":1}],"isReport":false}},{"id":"67a4f12a1aab7a2b602a6f63","author":{"_id":"62cd6e9ec4eb470b622a3c3d","avatarUrl":"/avatars/748cf9e123b036f0468f62fa0f8474b2.svg","fullname":"Ali Nadaf","name":"Alinadaf","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-02-06T17:28:10.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Interesting\n","html":"

Interesting

\n","updatedAt":"2025-02-06T17:29:31.361Z","author":{"_id":"62cd6e9ec4eb470b622a3c3d","avatarUrl":"/avatars/748cf9e123b036f0468f62fa0f8474b2.svg","fullname":"Ali Nadaf","name":"Alinadaf","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.991878092288971},"editors":["Alinadaf"],"editorAvatarUrls":["/avatars/748cf9e123b036f0468f62fa0f8474b2.svg"],"reactions":[],"isReport":false}},{"id":"67a5b638f201b5e6a54c3c8c","author":{"_id":"660b7bd16e7cc805715d095e","avatarUrl":"/avatars/e876dadd1895d603d77e657587974384.svg","fullname":"tangshicheng","name":"ttsscc","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-02-07T07:28:56.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"When can we try this model?","html":"

When can we try this model?

\n","updatedAt":"2025-02-07T07:28:56.118Z","author":{"_id":"660b7bd16e7cc805715d095e","avatarUrl":"/avatars/e876dadd1895d603d77e657587974384.svg","fullname":"tangshicheng","name":"ttsscc","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8344575762748718},"editors":["ttsscc"],"editorAvatarUrls":["/avatars/e876dadd1895d603d77e657587974384.svg"],"reactions":[{"reaction":"👍","users":["otterpupp","nahidalam","farhankh","sahmed997","gsdgrfsAGaga","Abdusalam7471"],"count":6}],"isReport":false},"replies":[{"id":"67e908211f2d113c45fe464f","author":{"_id":"658618834bb41498f74afca6","avatarUrl":"/avatars/57d895cb07d209d3da5155e3f5331df8.svg","fullname":"Sascha Gehrke","name":"klotzz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-03-30T09:00:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"never. most models will be sold to the highest bidder so companies can extort exorbitant monthly fees from us.","html":"

never. most models will be sold to the highest bidder so companies can extort exorbitant monthly fees from us.

\n","updatedAt":"2025-03-30T09:00:17.060Z","author":{"_id":"658618834bb41498f74afca6","avatarUrl":"/avatars/57d895cb07d209d3da5155e3f5331df8.svg","fullname":"Sascha Gehrke","name":"klotzz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9108904600143433},"editors":["klotzz"],"editorAvatarUrls":["/avatars/57d895cb07d209d3da5155e3f5331df8.svg"],"reactions":[],"isReport":false,"parentCommentId":"67a5b638f201b5e6a54c3c8c"}}]},{"id":"67a7543654bb01bdca7fe8a6","author":{"_id":"636380fd9234172f5834f55c","avatarUrl":"/avatars/0dd80cb837ff18393349cbba8edd24f7.svg","fullname":"Chris Brown","name":"PromptKing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-02-08T12:55:18.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Will it come with an API?","html":"

Will it come with an API?

\n","updatedAt":"2025-02-08T12:55:18.809Z","author":{"_id":"636380fd9234172f5834f55c","avatarUrl":"/avatars/0dd80cb837ff18393349cbba8edd24f7.svg","fullname":"Chris Brown","name":"PromptKing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8930317163467407},"editors":["PromptKing"],"editorAvatarUrls":["/avatars/0dd80cb837ff18393349cbba8edd24f7.svg"],"reactions":[],"isReport":false}},{"id":"67a8cb3c556fa47a175437f7","author":{"_id":"66137154bfea8695cb1fcf3c","avatarUrl":"/avatars/d63a5b0d379da38b6353b5e2492911e3.svg","fullname":"ahmed ehab ahmed","name":"Ahmed266","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-02-09T15:35:24.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2025-02-09T15:35:33.665Z","author":{"_id":"66137154bfea8695cb1fcf3c","avatarUrl":"/avatars/d63a5b0d379da38b6353b5e2492911e3.svg","fullname":"ahmed ehab ahmed","name":"Ahmed266","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"67bec70b5a29257e4860d2ca","author":{"_id":"64245fd8838e34c19ab7477e","avatarUrl":"/avatars/375ffcaebfc66bde5018663e2ddca6aa.svg","fullname":"Abdusalam Saheed Olamide","name":"Abdusalam7471","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-02-26T07:47:23.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Any update?","html":"

Any update?

\n","updatedAt":"2025-02-26T07:47:23.426Z","author":{"_id":"64245fd8838e34c19ab7477e","avatarUrl":"/avatars/375ffcaebfc66bde5018663e2ddca6aa.svg","fullname":"Abdusalam Saheed Olamide","name":"Abdusalam7471","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"it","probability":0.564523458480835},"editors":["Abdusalam7471"],"editorAvatarUrls":["/avatars/375ffcaebfc66bde5018663e2ddca6aa.svg"],"reactions":[{"reaction":"👍","users":["Abdusalam7471"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.02492","authors":[{"_id":"67a2ec904ea0e3138ac966f2","user":{"_id":"6181c72cdcc1df2c9de8a4d8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1655248010394-6181c72cdcc1df2c9de8a4d8.jpeg","isPro":false,"fullname":"Hila Chefer","user":"Hila","type":"user"},"name":"Hila Chefer","status":"extracted_confirmed","statusLastChangedAt":"2025-06-04T04:29:22.790Z","hidden":false},{"_id":"67a2ec904ea0e3138ac966f3","user":{"_id":"6345b71843f4f2d2ed113355","avatarUrl":"/avatars/a497669a4c53a724c4f6ea615d1dda59.svg","isPro":false,"fullname":"Uriel Singer","user":"urielsinger","type":"user"},"name":"Uriel Singer","status":"admin_assigned","statusLastChangedAt":"2025-02-05T16:53:54.046Z","hidden":false},{"_id":"67a2ec904ea0e3138ac966f4","name":"Amit Zohar","hidden":false},{"_id":"67a2ec904ea0e3138ac966f5","name":"Yuval Kirstain","hidden":false},{"_id":"67a2ec904ea0e3138ac966f6","name":"Adam Polyak","hidden":false},{"_id":"67a2ec904ea0e3138ac966f7","name":"Yaniv Taigman","hidden":false},{"_id":"67a2ec904ea0e3138ac966f8","name":"Lior Wolf","hidden":false},{"_id":"67a2ec904ea0e3138ac966f9","name":"Shelly Sheynin","hidden":false}],"publishedAt":"2025-02-04T17:07:10.000Z","submittedOnDailyAt":"2025-02-05T02:16:17.626Z","title":"VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion\n Generation in Video Models","submittedOnDailyBy":{"_id":"6181c72cdcc1df2c9de8a4d8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1655248010394-6181c72cdcc1df2c9de8a4d8.jpeg","isPro":false,"fullname":"Hila Chefer","user":"Hila","type":"user"},"summary":"Despite tremendous recent progress, generative video models still struggle to\ncapture real-world motion, dynamics, and physics. We show that this limitation\narises from the conventional pixel reconstruction objective, which biases\nmodels toward appearance fidelity at the expense of motion coherence. To\naddress this, we introduce VideoJAM, a novel framework that instills an\neffective motion prior to video generators, by encouraging the model to learn a\njoint appearance-motion representation. VideoJAM is composed of two\ncomplementary units. During training, we extend the objective to predict both\nthe generated pixels and their corresponding motion from a single learned\nrepresentation. During inference, we introduce Inner-Guidance, a mechanism that\nsteers the generation toward coherent motion by leveraging the model's own\nevolving motion prediction as a dynamic guidance signal. Notably, our framework\ncan be applied to any video model with minimal adaptations, requiring no\nmodifications to the training data or scaling of the model. VideoJAM achieves\nstate-of-the-art performance in motion coherence, surpassing highly competitive\nproprietary models while also enhancing the perceived visual quality of the\ngenerations. These findings emphasize that appearance and motion can be\ncomplementary and, when effectively integrated, enhance both the visual quality\nand the coherence of video generation. Project website:\nhttps://hila-chefer.github.io/videojam-paper.github.io/","upvotes":66,"discussionId":"67a2ec934ea0e3138ac9678e","ai_summary":"VideoJAM enhances generative video models by integrating motion coherence with appearance fidelity, achieving top performance in motion coherence and visual quality.","ai_keywords":["generative video models","motion coherence","dynamics","physics","pixel reconstruction","VideoJAM","joint appearance-motion representation","Inner-Guidance","dynamic guidance signal"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630412d57373aacccd88af95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670594087059-630412d57373aacccd88af95.jpeg","isPro":true,"fullname":"Yasunori Ozaki","user":"alfredplpl","type":"user"},{"_id":"635cada2c017767a629db012","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667018139063-noauth.jpeg","isPro":false,"fullname":"Ojasvi Singh Yadav","user":"ojasvisingh786","type":"user"},{"_id":"63bbf972d8d676a2299cdb44","avatarUrl":"/avatars/366d6ca7a4e19e42d2ec236a38d74ebd.svg","isPro":false,"fullname":"Sangwon","user":"agwmon","type":"user"},{"_id":"650259a6f10502379a72bbd6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650259a6f10502379a72bbd6/LDzIly2tU3rxOeSJSS8X3.jpeg","isPro":true,"fullname":"Christian","user":"HarleyCooper","type":"user"},{"_id":"6181c72cdcc1df2c9de8a4d8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1655248010394-6181c72cdcc1df2c9de8a4d8.jpeg","isPro":false,"fullname":"Hila Chefer","user":"Hila","type":"user"},{"_id":"6484c61300f3d63d6c3d6b27","avatarUrl":"/avatars/ce9b99882a65fd2cb983ba71a5ac2473.svg","isPro":false,"fullname":"Aryan V S","user":"a-r-r-o-w","type":"user"},{"_id":"6527e89a8808d80ccff88b7a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6527e89a8808d80ccff88b7a/CuGNmF1Et8KMQ0mCd1NEJ.jpeg","isPro":true,"fullname":"Not Lain","user":"not-lain","type":"user"},{"_id":"6460d0f456c57120a2ce239a","avatarUrl":"/avatars/2b461ce6b5bb9a70c994bab08d493cf3.svg","isPro":false,"fullname":"Unknown Entity","user":"unknownentity","type":"user"},{"_id":"62d7f90b102d144db4b4245b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d7f90b102d144db4b4245b/qR4GHvVyWW9KR83ItUMtr.jpeg","isPro":false,"fullname":"南栖","user":"Minami-su","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"63eb8b1113a3eb9b0dc89d8c","avatarUrl":"/avatars/d9cb7bdf4f3d2218f7d84120a00054bb.svg","isPro":false,"fullname":"Yair Shpitzer","user":"yairshp","type":"user"},{"_id":"61868ce808aae0b5499a2a95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61868ce808aae0b5499a2a95/F6BA0anbsoY_Z7M1JrwOe.jpeg","isPro":true,"fullname":"Sylvain Filoni","user":"fffiloni","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">

Papers

arxiv:2502.02492

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Published on Feb 4, 2025

· Submitted by

Hila Chefer on Feb 5, 2025

#1 Paper of the day

Upvote

Authors:

Hila Chefer ,

Uriel Singer ,

Abstract

VideoJAM enhances generative video models by integrating motion coherence with appearance fidelity, achieving top performance in motion coherence and visual quality.

AI-generated summary

Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/

View arXiv page View PDF Add to collection

Community

Hila

Paper author Paper submitter Feb 5, 2025

•

edited Feb 5, 2025

VideoJAM is a generic framework for motion and physics improvement in T2V models.

It sets new SOTA 🎉 in motion generation and understanding, even though it was only fine-tuned on 3 million samples.
Project website: https://hila-chefer.github.io/videojam-paper.github.io/