Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - SViMo: Synchronized Diffusion for Video and Motion Generation in
Hand-object Interaction Scenarios
\n\n","updatedAt":"2025-09-19T04:54:53.647Z","author":{"_id":"62ebd791fee90fca4742ead8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg","fullname":"levon dang","name":"levondang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.6594488024711609},"editors":["levondang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg"],"reactions":[],"isReport":false}},{"id":"6842fca1f2b666c22c8e9928","author":{"_id":"62ebd791fee90fca4742ead8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg","fullname":"levon dang","name":"levondang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-06-06T14:35:13.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","hiddenReason":"Resolved","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2025-09-19T04:55:22.262Z","author":{"_id":"62ebd791fee90fca4742ead8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg","fullname":"levon dang","name":"levondang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"6843975d7a39c14c01a52f11","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-06-07T01:35:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction](https://huggingface.co/papers/2504.21855) (2025)\n* [MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation](https://huggingface.co/papers/2505.10238) (2025)\n* [UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes](https://huggingface.co/papers/2505.12774) (2025)\n* [TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation](https://huggingface.co/papers/2504.08181) (2025)\n* [Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation](https://huggingface.co/papers/2504.14899) (2025)\n* [LatentMove: Towards Complex Human Movement Video Generation](https://huggingface.co/papers/2505.22046) (2025)\n* [DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation](https://huggingface.co/papers/2504.15032) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-06-07T01:35:25.686Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6947861909866333},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2506.02444","authors":[{"_id":"6841bc292852c1d7d4ab7c43","user":{"_id":"62ebd791fee90fca4742ead8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg","isPro":false,"fullname":"levon dang","user":"levondang","type":"user"},"name":"Lingwei Dang","status":"claimed_verified","statusLastChangedAt":"2025-06-05T17:08:01.386Z","hidden":true},{"_id":"6841bc292852c1d7d4ab7c44","name":"Ruizhi Shao","hidden":false},{"_id":"6841bc292852c1d7d4ab7c45","name":"Hongwen Zhang","hidden":false},{"_id":"6841bc292852c1d7d4ab7c46","name":"Wei Min","hidden":false},{"_id":"6841bc292852c1d7d4ab7c47","name":"Yebin Liu","hidden":false},{"_id":"6841bc292852c1d7d4ab7c48","name":"Qingyao Wu","hidden":false}],"publishedAt":"2025-06-03T05:04:29.000Z","submittedOnDailyAt":"2025-06-06T13:05:13.496Z","title":"SViMo: Synchronized Diffusion for Video and Motion Generation in\n Hand-object Interaction Scenarios","submittedOnDailyBy":{"_id":"62ebd791fee90fca4742ead8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg","isPro":false,"fullname":"levon dang","user":"levondang","type":"user"},"summary":"Hand-Object Interaction (HOI) generation has significant application\npotential. However, current 3D HOI motion generation approaches heavily rely on\npredefined 3D object models and lab-captured motion data, limiting\ngeneralization capabilities. Meanwhile, HOI video generation methods prioritize\npixel-level visual fidelity, often sacrificing physical plausibility.\nRecognizing that visual appearance and motion patterns share fundamental\nphysical laws in the real world, we propose a novel framework that combines\nvisual priors and dynamic constraints within a synchronized diffusion process\nto generate the HOI video and motion simultaneously. To integrate the\nheterogeneous semantics, appearance, and motion features, our method implements\ntri-modal adaptive modulation for feature aligning, coupled with 3D\nfull-attention for modeling inter- and intra-modal dependencies. Furthermore,\nwe introduce a vision-aware 3D interaction diffusion model that generates\nexplicit 3D interaction sequences directly from the synchronized diffusion\noutputs, then feeds them back to establish a closed-loop feedback cycle. This\narchitecture eliminates dependencies on predefined object models or explicit\npose guidance while significantly enhancing video-motion consistency.\nExperimental results demonstrate our method's superiority over state-of-the-art\napproaches in generating high-fidelity, dynamically plausible HOI sequences,\nwith notable generalization capabilities in unseen real-world scenarios.\nProject page at https://github.com/Droliven/SViMo\\_project.","upvotes":2,"discussionId":"6841bc2d2852c1d7d4ab7d31","projectPage":"https://droliven.github.io/SViMo_project","githubRepo":"https://github.com/Droliven/SViMo_code","githubRepoAddedBy":"auto","ai_summary":"A framework combining visual priors and dynamic constraints within a synchronized diffusion process generates HOI video and motion simultaneously, enhancing video-motion consistency and generalization.","ai_keywords":["synchronized diffusion process","tri-modal adaptive modulation","3D full-attention","vision-aware 3D interaction diffusion model","HOI video generation","HOI motion generation"],"githubStars":16},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62ebd791fee90fca4742ead8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg","isPro":false,"fullname":"levon dang","user":"levondang","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A framework combining visual priors and dynamic constraints within a synchronized diffusion process generates HOI video and motion simultaneously, enhancing video-motion consistency and generalization.
AI-generated summary
Hand-Object Interaction (HOI) generation has significant application
potential. However, current 3D HOI motion generation approaches heavily rely on
predefined 3D object models and lab-captured motion data, limiting
generalization capabilities. Meanwhile, HOI video generation methods prioritize
pixel-level visual fidelity, often sacrificing physical plausibility.
Recognizing that visual appearance and motion patterns share fundamental
physical laws in the real world, we propose a novel framework that combines
visual priors and dynamic constraints within a synchronized diffusion process
to generate the HOI video and motion simultaneously. To integrate the
heterogeneous semantics, appearance, and motion features, our method implements
tri-modal adaptive modulation for feature aligning, coupled with 3D
full-attention for modeling inter- and intra-modal dependencies. Furthermore,
we introduce a vision-aware 3D interaction diffusion model that generates
explicit 3D interaction sequences directly from the synchronized diffusion
outputs, then feeds them back to establish a closed-loop feedback cycle. This
architecture eliminates dependencies on predefined object models or explicit
pose guidance while significantly enhancing video-motion consistency.
Experimental results demonstrate our method's superiority over state-of-the-art
approaches in generating high-fidelity, dynamically plausible HOI sequences,
with notable generalization capabilities in unseen real-world scenarios.
Project page at https://github.com/Droliven/SViMo\_project.
TL;DR: A novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process for joint generation of video and motion in Hand-Object Interaction (HOI) scenarios.