The field of video generation is undergoing a paradigm shift - from generating realistic and appealing visuals to constructing world models that can simulate interactive and navigable environments. These models are not just visual tools; they serve as testbeds for training and evaluating intelligent agents, such as robots, autonomous vehicles, or virtual avatars. A central goal is to enable agents to perceive, act, and plan within generated video scenarios as if they were interacting with the real world. We compile key works that push video generation toward actionable world modeling, focusing physical plausibility, and the capacity for agents to navigate, manipulate, and learn from these synthetic environments.
This repository currently contains the paper list for "Video Generation towards World Model".
We hope to support the research and industrial communities by systematically collecting and organizing influential works that drive progress in video generation for world modeling.
- [06/2025] We are hosting CVPR 2025 Tutorial From Video Generation to World Model on June 11!
This repository is updated periodically. If you have suggestions for additional resources, updates on methodologies, or fixes for expiring links, please feel free to do any of the following:
- raise an Issue,
- nominate awesome related works with Pull Requests,
- For other queries: email both Ziqi
ZIQI002 at e dot ntu dot edu dot sgand Jingtongyuejingtong137 at gmail dot com.
-
1. Generation 1: Faithfulness - Accurate Simulation of the Real World
-
2. Generation 2: Interactiveness - Controllability and Interactive Dynamics
-
3. Generation 3: Planning - Modeling the Future Evolution of Complex Systems
-
MoStGAN-V: Video Generation with Temporal Motion Styles (2023-04-05)
-
Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks (2022-02-21)
-
StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 (2021-12-29)
-
Temporal Shift GAN for Large Scale Video Generation (2020-04-04)
-
Adversarial Video Generation on Complex Datasets (2019-07-15)
-
IRC-GAN: Introspective Recurrent Convolutional GAN for Text-to-video Generation (2019-03-07)
-
To Create What You Tell: Generating Videos from Captions (2018-04-23)
-
MoCoGAN: Decomposing Motion and Content for Video Generation (2017-07-17)
-
Temporal Generative Adversarial Nets with Singular Value Clipping (2016-11-21)
-
Generating Videos with Scene Dynamics (2016-09-08)
-
Temporal texture modeling (2002-08-06)
-
ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model (2024-11-24)
-
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators (2024-04-05)
-
AnimateLCM: Computation-Efficient Personalized Style Video Generation without Personalized Video Data (2024-02-01)
-
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos (2023-12-25)
-
InstructVideo: Instructing Video Diffusion Models with Human Feedback (2023-12-19)
-
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion (2023-12-07)
-
Make Pixels Dance: High-Dynamic Video Generation (2023-11-18)
-
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning (2023-11-17)
-
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling (2023-10-23)
-
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs (2023-08-26)
-
SimDA: Simple Diffusion Adapter for Efficient Video Generation (2023-08-18)
-
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising (2023-05-29)
-
Any-to-Any Generation via Composable Diffusion (2023-05-19)
-
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models (2023-04-18)
-
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation (2023-03-15)
-
Structure and Content-Guided Video Synthesis with Diffusion Models (2023-02-06)
-
Flexible Diffusion Modeling of Long Videos (2022-05-23)
-
MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation (2022-05-19)
-
Training-Free Efficient Video Generation via Dynamic Token Carving (2025-05-22)
-
AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset (2025-03-25)
-
IPO: Iterative Preference Optimization for Text-to-Video Generation (2025-02-04)
-
VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation (2024-12-21)
-
Allegro: Open the black box of commercial-level video generation model (2024-10-20)
-
Movie Gen: A Cast of Media Foundation Models (2024-10-17)
-
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design
-
Pyramidal Flow Matching for Efficient Video Generative Modeling (2024-10-08)
-
Emu3: Next-Token Prediction is All You Need (2024-09-27)
-
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations (2024-08-22)
-
T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback (2024-05-29)
-
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers (2024-01-16)
-
Photorealistic Video Generation with Diffusion Models (2023-12-11)
-
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models (2024-12-10)
-
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion (2024-11-07)
-
Loong: Generating Minute-level Long Videos with Autoregressive Language Models (2024-10-03)
-
VideoTetris: Towards Compositional Text-to-Video Generation (2024-06-06)
-
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text (2024-03-21)
-
VideoPoet: A Large Language Model for Zero-Shot Video Generation (2023-12-21)
-
Generative Multimodal Models are In-Context Learners (2023-12-20)
Geometry Condition
-
Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions (2024-01-03)
-
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models (2023-11-28)
-
ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation (2023-10-11)
-
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation (2023-07-13)
-
VideoComposer: Compositional Video Synthesis with Motion Controllability (2023-06-03)
-
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning (2023-05-23)
-
ControlVideo: Training-free Controllable Text-to-Video Generation (2023-05-22)
-
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators (2023-03-23)
-
Adding Conditional Control to Text-to-Image Diffusion Models (2023-02-10)
3D Condition
-
Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models (2024-05-26)
-
VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model (2024-03-18)
-
SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion (2024-03-18)
-
V3D: Video Diffusion Models are Effective 3D Generators (2024-03-11)
Physics Condition
-
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback (2024-12-03)
-
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation (2024-09-27)
Trajectory Navigation
-
Motion Prompting: Controlling Video Generation with Motion Trajectories (2024-12-03)
-
SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation (2024-11-07)
-
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models (2024-06-24)
-
Image Conductor: Precision Control for Interactive Video Synthesis (2024-06-21)
-
DragAnything: Motion Control for Anything using Entity Representation (2024-03-12)
-
InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions (2024-02-05)
-
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion (2024-02-05)
-
TrailBlazer: Trajectory Control for Diffusion-Based Video Generation (2023-12-21)
-
PEEKABOO: Interactive Video Generation via Masked-Diffusion (2023-12-12)
-
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation (2023-12-06)
-
Fine-grained Controllable Video Generation via Object Appearance and Context (2023-12-05)
Camera Motion Navigation
-
FullDiT: Multi-Task Video Generative Foundation Model with Full Attention (2025-03-25)
-
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video (2025-03-14)
-
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation (2025-02-12)
-
AKiRa: Augmentation Kit on Rays for optical video generation (2024-12-18)
-
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints (2024-12-10)
-
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers (2024-11-27)
-
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning (2024-11-07)
-
Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention (2024-10-14)
-
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control (2024-07-17)
-
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation (2024-06-04)
-
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control (2024-05-27)
-
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis (2024-05-23)
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation (2024-04-02)
Instruction Navigation
Action Navigation
- WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens (2024-01-18)
Action Navigation
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models (2024-05-24)
-
Model-Based Reinforcement Learning for Atari (2019-03-01)
Instruction Navigation
Goal Navigation
- Robot Motion Planning as Video Prediction: A Spatio-Temporal Neural Network-based Motion Planner (2022-12-26)
Hybrid Navigation
Layout Condition
-
Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention (2024-12-04)
-
MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control (2024-09-10)
-
DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation (2024-09-09)
-
Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation (2024-06-03)
-
SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control (2024-03-28)
-
Panacea: Panoramic and Controllable Video Generation for Autonomous Driving (2023-11-28)
-
MagicDrive: Street View Generation with Diverse 3D Geometry Control (2023-10-04)
Instruction Navigation
Action Navigation
-
WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation (2023-12-05)
-
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving (2023-11-29)
-
DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving (2023-09-18)
-
Model-Based Imitation Learning for Urban Driving (2022-10-14)
-
Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model (2022-10-08)
-
Iso-Dream: Isolating and Leveraging Noncontrollable Visual Dynamics in World Models (2022-05-27)
-
DriveGAN: Towards a Controllable High-Quality Neural Simulation (2021-04-30)
-
Learning a Driving Simulator (2016-08-03)
Hybrid Navigation
Other Navigation
- Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception (2025-03-17)
Controller Navigation
-
Video2 Game Generation: A Practical Study using Mario (2024)
-
Playable Video Generation (2021-01-28)
-
Learning to Simulate Dynamic Environments with GameGAN (2020-05-25)
Action Navigation
-
Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion Models (2023-03-23)
-
Transformer-based World Models Are Happy With 100k Interactions (2023-03-13)
-
Mastering Diverse Domains through World Models (2023-01-10)
-
Learning General World Models in a Handful of Reward-Free Deployments (2022-10-23)
-
Transformers are Sample-Efficient World Models (2022-09-05)
-
Playable Environments: Video Manipulation in Space and Time (2022-03-03)
-
Mastering Atari with Discrete World Models (2020-10-05)
-
Dream to Control: Learning Behaviors by Latent Imagination (2019-12-03)
-
Recurrent Environment Simulators (2017-04-07) Hybrid Navigation
-
SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents (2026-03-09)
-
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling (2025-07-10)
-
VMoBA: Mixture-of-Block Attention for Video Diffusion Models (2025-06-30)
-
Long-Context Autoregressive Video Modeling with Next-Frame Prediction (2025-03-25)
-
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models (2025-02-04)
World Model Regulation Methods
Stabilization:
- Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention (2026-01-21)
Inference-time Physics Alignment:
Efficiency:
Planning Optimization:
Long Video Generation Methods
- LIVE: Long-horizon Interactive Video World Modeling (2026-02-03)
- UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation (2025-12-08)
- MultiShotMaster: A Controllable Multi-Shot Video Generation Framework (2025-12-02)
- VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning (2025-10-29)
- SketchVideo: Sketch-based Video Generation and Editing (2025-03-30)
-
Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields (2026-01-29)
-
LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion (2025-07-03)
-
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control (2025-01-07)
-
GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking (2025-01-05)
-
World-consistent Video Diffusion with Explicit 3D Modeling (2024-12-02)
-
Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals (2026-01-09)
-
PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning (2025-10-15)
-
WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation (2025-03-11)
-
Synthetic Video Enhances Physical Fidelity in Video Synthesis (2025-03-26)
-
PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop (2025-03-12)
-
PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation (2024-11-30)
-
InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing (2025-08-19)
-
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency (2024-09-04)
-
The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text (2025-12-18)
-
Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise (2025-01-14)
-
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation (2024-12-10)
-
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control (2026-01-08)
-
AirScape: An Aerial Generative World Model with Motion Controllability (2025-07-10)
-
CamCloneMaster: Enabling Reference-based Camera Control for Video Generation (2025-06-03)
-
Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval (2025-06-03)
-
TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation (2025-04-11)
-
CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models (2025-03-13)
-
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control (2025-03-05)
-
Training-free Camera Control for Video Generation (2024-06-14)
-
UniVideo: Unified Understanding, Generation, and Editing for Videos (2025-10-09)
-
Video World Models with Long-term Spatial Memory (2025-06-05)
-
SlowFast-VGen: Slow-Fast Learning for Action-driven Long Video Generation (2024-10-30)
-
Pandora: Towards general world model with natural language actions and video states (2024-06-12)
- Grounding World Simulation Models in a Real-World Metropolis (2026-03-16)
- WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions (2025-05-23)
- Introducing Multiverse: The First AI Multiplayer World Model (2025-05-08)
- Aether: Geometric-Aware Unified World Modeling (2025-03-24)
-
Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation (2026-03-17)
-
Interactive World Simulator for Robot Policy Training and Evaluation (2026-03-09)
-
BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks (2026-02-03)
-
Walk through Paintings: Egocentric World Models from Internet Priors (2026-01-21)
-
Aerial World Model for Long-horizon Visual Generation and Navigation in 3D Space (2025-12-26)
-
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning (2025-07-16)
-
EmbodieDreamer: Advancing Real2Sim2Real Transfer for Policy Training via Embodied World Modeling (2025-07-07)
-
WorldVLA: Towards Autoregressive Action World Model (2025-06-26)
-
Consistent World Models via Foresight Diffusion (2025-05-22)
-
Learning 3D Persistent Embodied World Models (2025-05-05)
-
Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control (2025-03-18)
-
Unified Video Action Model (2025-02-28)
-
Pre-Trained Video Generative Models as World Simulators (2025-02-10)
-
Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression (2025-02-06)
-
Prediction with Action: Visual Policy Learning via Joint Denoising Process (2024-11-27)
-
IRASim: Learning Interactive Real-Robot Action Simulators (2024-06-20)
-
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning (2026-01-22)
-
ReWorld: Multi-Dimensional Reward Modeling for Embodied World Models (2026-01-18)
-
InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation (2026-01-05)
-
VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation (2024-11-14)
-
EVA: An Embodied World Model for Future Video Anticipation (2024-10-20)
-
VideoAgent: Self-Improving Video Generation (2024-10-14)
-
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation (2024-04-16)
-
Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training (2024-02-22)
-
Video Language Planning (2023-10-16)
-
Learning Interactive Real-World Simulators (2023-10-09)
-
Compositional Foundation Models for Hierarchical Planning (2023-09-15)
-
Grounding Video Models to Actions through Goal Conditioned Exploration (2024-11-11)
-
Learning to Act from Actionless Videos through Dense Correspondences (2023-10-12)
-
Visuo-Tactile World Models (2026-02-05)
-
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge (2025-07-06)
-
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control (2025-06-02)
-
NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants (2025-02-19)
-
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (2023-12-20)
-
Evaluating Robot Policies in a World Model (2025-05-31)
-
Learning Interactive Real-World Simulators (2023-10-09)
-
DriveFix: Spatio-Temporally Coherent Driving Scene Restoration (2026-03-17)
-
ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask (2026-02-03)
-
UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving (2026-02-02)
-
Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models (2025-06-10)
-
CoGen: 3D Consistent Video Generation via Adaptive Conditioning for Autonomous Driving (2025-03-28)
-
MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving (2025-03-20)
-
UniScene: Unified Occupancy-centric Driving Scene Generation (2024-12-06)
-
DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes (2024-09-06)
-
DiVE: DiT-based Video Generation with Enhanced Control (2024-09-03)
-
DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model (2023-10-11)
-
InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation (2026-02-03)
-
UniDWM: Towards a Unified Driving World Model via Multifaceted Representation Learning (2026-02-02)
-
DISK: Dynamic Inference SKipping for World Models (2026-01-31)
-
MAD: Motion Appearance Decoupling for efficient Driving World Models (2026-01-14)
-
Epona: Autoregressive Diffusion World Model for Autonomous Driving (2025-06-30)
-
DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT (2024-12-27)
-
GEM:AGeneralizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control (2024-12-15)
-
Doe-1: Closed-Loop Autonomous Driving with Large World Model (2024-12-12)
-
ACT-BENCH:Towards Action Controllable World Models for Autonomous Driving (2024-12-06)
-
MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control (2024-11-21)
-
MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes (2024-05-23)
-
Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving (2026-01-29)
-
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers (2024-12-24)
-
InfinityDrive: Breaking Time Limits in Driving World Models (2024-12-02)
-
GAIA-1: A Generative World Model for Autonomous Driving (2023-09-29)
-
SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving (2026-03-09)
-
GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving (2025-03-26)
-
MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction (2025-02-17)
-
Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability (2024-05-27)
-
Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models (2025-07-17)
-
Physical Informed Driving World Model (2024-12-11)
-
WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation (2026-03-17)
-
Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory (2026-02-03)
-
The World's First AI-Native UGC Game Engine Powered by Real-Time World Model (2025-07-03)
-
Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition (2025-06-20)
-
WORLDMEM: Long-term Consistent World Simulation with Memory (2025-04-16)
-
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft (2025-04-11)
-
Model as a Game: On Numerical and Spatial Consistency for Generative Games (2025-03-27)
-
World and Human Action Models towards gameplay ideation (2025-02-19)
-
GameFactory: Creating New Games with Generative Interactive Videos (2025-01-14)
-
Genie 2: A large-scale foundation world model (2024-12-04)
-
The Matrix: Infinite-Horizon World Generation with Real-Time Moving Control (2024-12-04)
-
Playable Game Generation (2024-12-01)
-
GameGen-X: Interactive Open-world Game Video Generation (2024-11-01)
-
Oasis: A Universe in a Transformer (2024-10-31)
-
Diffusion Models Are Real-Time Game Engines (2024-08-27)
-
Diffusion for World Modeling: Visual Details Matter in Atari (2024-05-20)
-
Genie: Generative Interactive Environmentsl (2024-02-23)
- Accurate and Efficient World Modeling with Masked Latent Transformers (2025-07-05)
- Long-Context State-Space Video World Models (2025-05-26)
- DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning (2024-11-07)
| Date | Venue | Acronym | Paper | Project | Repo@GitHub |
|---|---|---|---|---|---|
| 2026-01-28 | Arxiv | LingBot-World | Advancing Open-source World Models | ||
| 2025-12-26 | Arxiv | Yume1.5 | Yume1.5: A Text-Controlled Interactive World Generation Model | ||
| 2025-12-16 | Arxiv | WorldPlay | WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling | ||
| 2025-12-09 | ICLR26 | Astra | Astra: General Interactive World Model with Autoregressive Denoising | ||
| 2025-07-17 | MirageLSD | MirageLSD: The First Live-Stream Diffusion AI Video Model | |||
| 2025-06-11 | Arxiv | V-JEPA 2 | V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning | ||
| 2024-12-04 | CVPR25 | NWM | Navigation World Models |
For Robotics:
- Causal World Modeling for Robot Control (2026-01-29)
- An Efficient and Multi-Modal Navigation System with One-Step World Model (2026-01-18)
Note: Action and goal navigation for robotics.
- WonderZoom: Multi-Scale 3D World Generation (2025-12-09)
-
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding (2025-07-20)
-
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness (2025-03-27)
-
FullDiT: Multi-Task Video Generative Foundation Model with Full Attention (2025-03-25)
-
VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation (2025-03-09)
-
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models (2024-11-20)
-
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation (2024-07-19)
-
VideoPhy: Evaluating Physical Commonsense for Video Generation (2024-06-05)
-
VBench: Comprehensive Benchmark Suite for Video Generative Models (2023-11-29)
-
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation (2023-11-03)
-
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models (2023-10-17)
-
WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models (2026-02-09)
-
WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models (2026-01-29)
-
Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models (2026-01-27)
-
DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving (2026-01-04)
-
Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test (2026-01-07)
-
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation (2025-06-27)
-
Evaluating Robot Policies in a World Model (2025-05-31)
-
WorldScore: A Unified Evaluation Benchmark for World Generation (2025-04-01)
-
WorldModelBench: Judging Video Generation Models As World Models (2025-02-28)
-
WorldSimBench: Towards Video Generation Models as World Simulators (2024-10-23)
-
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation (2024-10-07)
-
Action100M: A Large-scale Video Action Dataset (2026-01-15)
-
VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing (2024-11-22)
-
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content (2024-10-10)
-
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation (2024-08-05)
-
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (2024-07-08)
-
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers (2024-02-29)
-
Consistent Video-to-Video Transfer Using Synthetic Dataset (2023-11-01)
-
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions (2021-11-19)
-
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval (2021-04-01)
-
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips (2019-06-07)
-
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research (2019-04-06)
-
Localizing Moments in Video with Natural Language (2017-08-04)
-
Towards Automatic Learning of Procedures from Web Instructional Videos (2017-03-28)
-
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language (2016-12-12)
-
ActivityNet: A large-scale video benchmark for human activity understanding (2015-10-15)
-
A Dataset for Movie Description (2015-01-12)
-
From Perception to Action: Spatial AI Agents and World Models (2026-02-02)
-
A Mechanistic View on Video Generation as World Models: State and Dynamics (2026-01-22)
-
From Generative Engines to Actionable Simulators: The Imperative of Physical Grounding in World Models (2026-01-21)
-
Video Generation Models in Robotics - Applications, Research Challenges, Future Directions (2026-01-12)
-
Modeling the Mental World for Embodied AI: A Comprehensive Review (2025-12-17)
-
3D and 4D World Modeling: A Survey (2025-09-04)
-
Reconstructing 4D Spatial Intelligence: A Survey (2025-07-28)
-
A Survey: Learning Embodied Intelligence from Physical Simulators and World Models (2025-07-01)
-
A Survey of Interactive Generative Video (2025-04-30)
-
Survey of Video Diffusion Models: Foundations, Implementations, and Applications (2025-04-22)
-
Exploring the Evolution of Physics Cognition in Video Generation: A Survey (2025-03-27)
-
Generative Physical AI in Vision: A Survey (2025-01-19)
-
Joint Perception and Prediction for Autonomous Driving: A Survey (2024-12-18)
-
Understanding World or Predicting Future? A Comprehensive Survey of World Models (2024-11-21)
-
Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey (2024-11-05)
-
From Sora What We Can See: A Survey of Text-to-Video Generation (2024-05-17)
-
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond (2024-05-06)
-
Video Diffusion Models: A Survey (2024-05-06)
-
Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation (2024-03-08)
-
World Models for Autonomous Driving: An Initial Survey (2024-03-05)
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models (2024-02-27)
-
A Survey on Video Diffusion Models (2023-10-16)
-
Video Generative Adversarial Networks: A Review (2020-11-04)
-
Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks (2026-02-02)
-
An Empirical Study of World Model Quantization (2026-02-02)
-
Learning Latent Action World Models In The Wild (2026-01-08)
-
General agents need world models (2025-06-02)
-
Position: Interactive Generative Video as Next-Generation Game Engine (2025-03-21)
-
How Far is Video Generation from World Model: A Physical Law Perspective (2024-11-04)
-
Video as the New Language for Real-World Decision Making (2024-02-27)
-
A Path Towards Autonomous Machine Intelligence Version (2022-06-07)
- DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models (2026-03-17)
- NavThinker: Action-Conditioned World Models for Coupled Prediction and Planning in Social Navigation (2026-03-16)
- World-Gymnast: Training Robots with Reinforcement Learning in a World Model (2026-02-02)
- World Models as an Intermediary between Agents and the Real World (2026-01-31)
- TC-IDM: Grounding Video Generation for Executable Zero-shot Robot Motion (2026-01-26)
- Mirage2Matter: A Physically Grounded Gaussian World Model from Video (2026-01-24)
If you find this paper useful, please consider citing:
@article{yue2025video,
title={Simulating the World Model with Artificial Intelligence: A Roadmap},
author={Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, Ziwei Liu},
journal={arXiv preprint arXiv:2511.08585},
year={2025}
}