Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios
[go: Go Back, main page]

https://github.com/Droliven/SViMo_project.\n
  • Video demonstration: https://www.youtube.com/watch?v=pVkntn-8KHo.
  • \n\n","updatedAt":"2025-09-19T04:54:53.647Z","author":{"_id":"62ebd791fee90fca4742ead8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg","fullname":"levon dang","name":"levondang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.6594488024711609},"editors":["levondang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg"],"reactions":[],"isReport":false}},{"id":"6842fca1f2b666c22c8e9928","author":{"_id":"62ebd791fee90fca4742ead8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg","fullname":"levon dang","name":"levondang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-06-06T14:35:13.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","hiddenReason":"Resolved","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2025-09-19T04:55:22.262Z","author":{"_id":"62ebd791fee90fca4742ead8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg","fullname":"levon dang","name":"levondang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"6843975d7a39c14c01a52f11","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-06-07T01:35:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction](https://huggingface.co/papers/2504.21855) (2025)\n* [MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation](https://huggingface.co/papers/2505.10238) (2025)\n* [UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes](https://huggingface.co/papers/2505.12774) (2025)\n* [TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation](https://huggingface.co/papers/2504.08181) (2025)\n* [Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation](https://huggingface.co/papers/2504.14899) (2025)\n* [LatentMove: Towards Complex Human Movement Video Generation](https://huggingface.co/papers/2505.22046) (2025)\n* [DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation](https://huggingface.co/papers/2504.15032) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

    This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

    \n

    The following papers were recommended by the Semantic Scholar API

    \n\n

    Please give a thumbs up to this comment if you found it helpful!

    \n

    If you want recommendations for any Paper on Hugging Face checkout this Space

    \n

    You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

    \n","updatedAt":"2025-06-07T01:35:25.686Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6947861909866333},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2506.02444","authors":[{"_id":"6841bc292852c1d7d4ab7c43","user":{"_id":"62ebd791fee90fca4742ead8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg","isPro":false,"fullname":"levon dang","user":"levondang","type":"user"},"name":"Lingwei Dang","status":"claimed_verified","statusLastChangedAt":"2025-06-05T17:08:01.386Z","hidden":true},{"_id":"6841bc292852c1d7d4ab7c44","name":"Ruizhi Shao","hidden":false},{"_id":"6841bc292852c1d7d4ab7c45","name":"Hongwen Zhang","hidden":false},{"_id":"6841bc292852c1d7d4ab7c46","name":"Wei Min","hidden":false},{"_id":"6841bc292852c1d7d4ab7c47","name":"Yebin Liu","hidden":false},{"_id":"6841bc292852c1d7d4ab7c48","name":"Qingyao Wu","hidden":false}],"publishedAt":"2025-06-03T05:04:29.000Z","submittedOnDailyAt":"2025-06-06T13:05:13.496Z","title":"SViMo: Synchronized Diffusion for Video and Motion Generation in\n Hand-object Interaction Scenarios","submittedOnDailyBy":{"_id":"62ebd791fee90fca4742ead8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg","isPro":false,"fullname":"levon dang","user":"levondang","type":"user"},"summary":"Hand-Object Interaction (HOI) generation has significant application\npotential. However, current 3D HOI motion generation approaches heavily rely on\npredefined 3D object models and lab-captured motion data, limiting\ngeneralization capabilities. Meanwhile, HOI video generation methods prioritize\npixel-level visual fidelity, often sacrificing physical plausibility.\nRecognizing that visual appearance and motion patterns share fundamental\nphysical laws in the real world, we propose a novel framework that combines\nvisual priors and dynamic constraints within a synchronized diffusion process\nto generate the HOI video and motion simultaneously. To integrate the\nheterogeneous semantics, appearance, and motion features, our method implements\ntri-modal adaptive modulation for feature aligning, coupled with 3D\nfull-attention for modeling inter- and intra-modal dependencies. Furthermore,\nwe introduce a vision-aware 3D interaction diffusion model that generates\nexplicit 3D interaction sequences directly from the synchronized diffusion\noutputs, then feeds them back to establish a closed-loop feedback cycle. This\narchitecture eliminates dependencies on predefined object models or explicit\npose guidance while significantly enhancing video-motion consistency.\nExperimental results demonstrate our method's superiority over state-of-the-art\napproaches in generating high-fidelity, dynamically plausible HOI sequences,\nwith notable generalization capabilities in unseen real-world scenarios.\nProject page at https://github.com/Droliven/SViMo\\_project.","upvotes":2,"discussionId":"6841bc2d2852c1d7d4ab7d31","projectPage":"https://droliven.github.io/SViMo_project","githubRepo":"https://github.com/Droliven/SViMo_code","githubRepoAddedBy":"auto","ai_summary":"A framework combining visual priors and dynamic constraints within a synchronized diffusion process generates HOI video and motion simultaneously, enhancing video-motion consistency and generalization.","ai_keywords":["synchronized diffusion process","tri-modal adaptive modulation","3D full-attention","vision-aware 3D interaction diffusion model","HOI video generation","HOI motion generation"],"githubStars":16},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62ebd791fee90fca4742ead8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ebd791fee90fca4742ead8/8-iJYS9Wk41l6JTGtJrNf.jpeg","isPro":false,"fullname":"levon dang","user":"levondang","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
    Papers
    arxiv:2506.02444

    SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios

    Published on Jun 3, 2025
    · Submitted by
    levon dang
    on Jun 6, 2025
    Authors:
    ,
    ,
    ,
    ,

    Abstract

    A framework combining visual priors and dynamic constraints within a synchronized diffusion process generates HOI video and motion simultaneously, enhancing video-motion consistency and generalization.

    AI-generated summary

    Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at https://github.com/Droliven/SViMo\_project.

    Community

    Paper author Paper submitter
    edited Sep 19, 2025
    Paper author Paper submitter
    This comment has been hidden (marked as Resolved)

    This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

    The following papers were recommended by the Semantic Scholar API

    Please give a thumbs up to this comment if you found it helpful!

    If you want recommendations for any Paper on Hugging Face checkout this Space

    You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

    Sign up or log in to comment

    Models citing this paper 0

    No model linking this paper

    Cite arxiv.org/abs/2506.02444 in a model README.md to link it from this page.

    Datasets citing this paper 0

    No dataset linking this paper

    Cite arxiv.org/abs/2506.02444 in a dataset README.md to link it from this page.

    Spaces citing this paper 0

    No Space linking this paper

    Cite arxiv.org/abs/2506.02444 in a Space README.md to link it from this page.

    Collections including this paper 1