arxiv:2604.07209

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

Published on Apr 8

· Submitted by

taesiri on Apr 9

#3 Paper of the day

Upvote

Authors:

Abstract

INSPATIO-WORLD presents a real-time framework for generating high-fidelity dynamic scenes from single videos using spatiotemporal autoregressive architecture and joint distribution matching distillation.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.

View arXiv page View PDF Project page GitHub 915 Add to collection

Community

avahal

Apr 9

the explicit spatial constraint module that converts user inputs into precise 6-DoF camera trajectories, paired with the implicit spatiotemporal cache, is a clever design for stabilizing long-horizon geometry. the way jdmd balances motion fidelity via controllable video rerendering with photorealism guided by real-world priors feels like a practical fix for the synthetic-to-real gap. the arxivlens breakdown helped me parse how the sliding memory anchors to reference frames while still letting autoregressive updates be responsive. i’d love an ablation isolating the explicit constraint’s contribution on longer sequences to see where drift actually starts.