Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
[go: Go Back, main page]

https://github.com/hzxie/DynamicVLA\n
  • Project Page: https://haozhexie.com/project/dynamic-vla
  • \n
  • Spotlight Video: https://www.youtube.com/watch?v=NmJnHcI04_Q
  • \n\n","updatedAt":"2026-01-30T04:16:39.721Z","author":{"_id":"63f47b5321eb234ab739e91a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f47b5321eb234ab739e91a/vWfFNVtMkHl8gieha5PPd.jpeg","fullname":"Haozhe Xie","name":"hzxie","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":22,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7493870258331299},"editors":["hzxie"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63f47b5321eb234ab739e91a/vWfFNVtMkHl8gieha5PPd.jpeg"],"reactions":[],"isReport":false}},{"id":"697d2f22891a824242e29161","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-01-30T22:22:26.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/dynamicvla-a-vision-language-action-model-for-dynamic-object-manipulation-4139-91b8768b\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"

    arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/dynamicvla-a-vision-language-action-model-for-dynamic-object-manipulation-4139-91b8768b

    \n
      \n
    • Executive Summary
    • \n
    • Detailed Breakdown
    • \n
    • Practical Applications
    • \n
    \n","updatedAt":"2026-01-30T22:22:26.811Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6428428888320923},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"697d5d0e891a824242e4cd9f","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-01-31T01:38:22.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention](https://huggingface.co/papers/2512.03724) (2025)\n* [PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation](https://huggingface.co/papers/2601.07060) (2026)\n* [Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling](https://huggingface.co/papers/2512.03044) (2025)\n* [Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos](https://huggingface.co/papers/2512.13080) (2025)\n* [ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance](https://huggingface.co/papers/2601.16667) (2026)\n* [Robotic VLA Benefits from Joint Learning with Motion Image Diffusion](https://huggingface.co/papers/2512.18007) (2025)\n* [Clutter-Resistant Vision-Language-Action Models through Object-Centric and Geometry Grounding](https://huggingface.co/papers/2512.22519) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

    This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

    \n

    The following papers were recommended by the Semantic Scholar API

    \n\n

    Please give a thumbs up to this comment if you found it helpful!

    \n

    If you want recommendations for any Paper on Hugging Face checkout this Space

    \n

    You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

    \n","updatedAt":"2026-01-31T01:38:22.108Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6967701315879822},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.22153","authors":[{"_id":"697c2899a67238fac88cc115","user":{"_id":"63f47b5321eb234ab739e91a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f47b5321eb234ab739e91a/vWfFNVtMkHl8gieha5PPd.jpeg","isPro":false,"fullname":"Haozhe Xie","user":"hzxie","type":"user"},"name":"Haozhe Xie","status":"claimed_verified","statusLastChangedAt":"2026-01-30T13:31:48.996Z","hidden":false},{"_id":"697c2899a67238fac88cc116","user":{"_id":"672392c4a4c4381cefc06416","avatarUrl":"/avatars/8ee84a7e3e91e5d13074bc3c407ff75d.svg","isPro":false,"fullname":"Wen Beichen","user":"wenbc21","type":"user"},"name":"Beichen Wen","status":"claimed_verified","statusLastChangedAt":"2026-01-30T13:31:52.487Z","hidden":false},{"_id":"697c2899a67238fac88cc117","user":{"_id":"6899ff3f4c5ca50a326bb456","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Nuqof2ofdaQUD5b07cDnG.png","isPro":false,"fullname":"Zheng Jiarui","user":"zghtyarecrenj","type":"user"},"name":"Jiarui Zheng","status":"claimed_verified","statusLastChangedAt":"2026-01-30T13:31:46.030Z","hidden":false},{"_id":"697c2899a67238fac88cc118","name":"Zhaoxi Chen","hidden":false},{"_id":"697c2899a67238fac88cc119","name":"Fangzhou Hong","hidden":false},{"_id":"697c2899a67238fac88cc11a","name":"Haiwen Diao","hidden":false},{"_id":"697c2899a67238fac88cc11b","name":"Ziwei Liu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/63f47b5321eb234ab739e91a/p9cPxETttQUS23woFb14M.mp4"],"publishedAt":"2026-01-29T18:59:51.000Z","submittedOnDailyAt":"2026-01-30T01:46:39.673Z","title":"DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation","submittedOnDailyBy":{"_id":"63f47b5321eb234ab739e91a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f47b5321eb234ab739e91a/vWfFNVtMkHl8gieha5PPd.jpeg","isPro":false,"fullname":"Haozhe Xie","user":"hzxie","type":"user"},"summary":"Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception-execution gap by enforcing temporally aligned action execution. To fill the missing foundation of dynamic manipulation data, we introduce the Dynamic Object Manipulation (DOM) benchmark, built from scratch with an auto data collection pipeline that efficiently gathers 200K synthetic episodes across 2.8K scenes and 206 objects, and enables fast collection of 2K real-world episodes without teleoperation. Extensive evaluations demonstrate remarkable improvements in response speed, perception, and generalization, positioning DynamicVLA as a unified framework for general dynamic object manipulation across embodiments.","upvotes":69,"discussionId":"697c2899a67238fac88cc11c","projectPage":"https://haozhexie.com/project/dynamic-vla","githubRepo":"https://github.com/hzxie/DynamicVLA","githubRepoAddedBy":"user","ai_summary":"DynamicVLA addresses dynamic object manipulation challenges through a compact vision-language-action model with temporal reasoning and closed-loop adaptation, supported by a new benchmark for dynamic manipulation tasks.","ai_keywords":["Vision-Language-Action models","temporal reasoning","closed-loop adaptation","convolutional vision encoder","multimodal inference","Continuous Inference","Latent-aware Action Streaming","Dynamic Object Manipulation benchmark","synthetic episodes","real-world episodes"],"githubStars":137,"organization":{"_id":"62d55f243bf5e059f7ca25ba","name":"mmlab-ntu","fullname":"MMLab@NTU","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1658151991971-62b5777f593a2c49da69dc02.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63f47b5321eb234ab739e91a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f47b5321eb234ab739e91a/vWfFNVtMkHl8gieha5PPd.jpeg","isPro":false,"fullname":"Haozhe Xie","user":"hzxie","type":"user"},{"_id":"66d347eebb76fb26eedb256e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66d347eebb76fb26eedb256e/iCPF7GkmZu--XCsWzoucl.jpeg","isPro":false,"fullname":"tianqi liu","user":"tqliu","type":"user"},{"_id":"60efe7fa0d920bc7805cada5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60efe7fa0d920bc7805cada5/2LBrJBjSCOP5ilZIpWLHl.png","isPro":false,"fullname":"Ziqi Huang","user":"Ziqi","type":"user"},{"_id":"6487e266de085085be8f0d42","avatarUrl":"/avatars/75053287fb3b39fef4349faf09bdc394.svg","isPro":false,"fullname":"Jing Lin","user":"linjing7","type":"user"},{"_id":"64db92a5858f8a41c11669b7","avatarUrl":"/avatars/e834d8f1d4781e3bb0b5d6d25b3b3505.svg","isPro":false,"fullname":"Zhouxia Wang","user":"wzhouxiff","type":"user"},{"_id":"66d6cf4fddf54fd9092a36f6","avatarUrl":"/avatars/80438e549e7d0aca93c46ed807bdc51c.svg","isPro":false,"fullname":"xichen98cn","user":"xichen98cn","type":"user"},{"_id":"64b4a717aa03b6520839e9b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b4a717aa03b6520839e9b8/Rt3ERG-6BVEA4hAwOz0_I.jpeg","isPro":false,"fullname":"Haiwen Diao","user":"Paranioar","type":"user"},{"_id":"6388af095a3d2a335622cb7c","avatarUrl":"/avatars/f548ce6a902cee8bdc74179bcd45534c.svg","isPro":false,"fullname":"Kehan Li","user":"lkhl","type":"user"},{"_id":"64f588183cd4ab07d6509f89","avatarUrl":"/avatars/f62fc4cd3329fbbfdef892252a5a11cb.svg","isPro":false,"fullname":"Xiaoming Li","user":"csxmli","type":"user"},{"_id":"646dee28174cc96d50951991","avatarUrl":"/avatars/17d88a24c905e9819268b27037015a35.svg","isPro":false,"fullname":"xu","user":"xuzhaopan","type":"user"},{"_id":"62ac6656de8bfbb93094b8fd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ac6656de8bfbb93094b8fd/ruJGOUcsG9UgDhTQx9Sz6.png","isPro":false,"fullname":"Kaiyang Zhou","user":"kaiyangzhou","type":"user"},{"_id":"631b24f2f6bc4be4a64c4d43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631b24f2f6bc4be4a64c4d43/P9_tVF7SESmVxxGKVCgCk.jpeg","isPro":false,"fullname":"Zihao Huang","user":"Inso","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3,"organization":{"_id":"62d55f243bf5e059f7ca25ba","name":"mmlab-ntu","fullname":"MMLab@NTU","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1658151991971-62b5777f593a2c49da69dc02.png"}}">
    Papers
    arxiv:2601.22153

    DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

    Published on Jan 29
    · Submitted by
    Haozhe Xie
    on Jan 30
    #3 Paper of the day
    Authors:
    ,
    ,
    ,

    Abstract

    DynamicVLA addresses dynamic object manipulation challenges through a compact vision-language-action model with temporal reasoning and closed-loop adaptation, supported by a new benchmark for dynamic manipulation tasks.

    AI-generated summary

    Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception-execution gap by enforcing temporally aligned action execution. To fill the missing foundation of dynamic manipulation data, we introduce the Dynamic Object Manipulation (DOM) benchmark, built from scratch with an auto data collection pipeline that efficiently gathers 200K synthetic episodes across 2.8K scenes and 206 objects, and enables fast collection of 2K real-world episodes without teleoperation. Extensive evaluations demonstrate remarkable improvements in response speed, perception, and generalization, positioning DynamicVLA as a unified framework for general dynamic object manipulation across embodiments.

    Community

    Paper author Paper submitter

    TL; DR: DynamicVLA enables open-ended dynamic object manipulation by pairing a compact 0.4B VLM with low-latency Continuous Inference and Latent-aware Action Streaming, evaluated at scale through the new DOM benchmark in both simulation and the real world.

    arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/dynamicvla-a-vision-language-action-model-for-dynamic-object-manipulation-4139-91b8768b

    • Executive Summary
    • Detailed Breakdown
    • Practical Applications

    This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

    The following papers were recommended by the Semantic Scholar API

    Please give a thumbs up to this comment if you found it helpful!

    If you want recommendations for any Paper on Hugging Face checkout this Space

    You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

    Sign up or log in to comment

    Models citing this paper 0

    No model linking this paper

    Cite arxiv.org/abs/2601.22153 in a model README.md to link it from this page.

    Datasets citing this paper 1

    Spaces citing this paper 0

    No Space linking this paper

    Cite arxiv.org/abs/2601.22153 in a Space README.md to link it from this page.

    Collections including this paper 9