Awesome From Video Generation to World Model

The field of video generation is undergoing a paradigm shift - from generating realistic and appealing visuals to constructing world models that can simulate interactive and navigable environments. These models are not just visual tools; they serve as testbeds for training and evaluating intelligent agents, such as robots, autonomous vehicles, or virtual avatars. A central goal is to enable agents to perceive, act, and plan within generated video scenarios as if they were interacting with the real world. We compile key works that push video generation toward actionable world modeling, focusing physical plausibility, and the capacity for agents to navigate, manipulate, and learn from these synthetic environments.

Overview

This repository currently contains the paper list for "Video Generation towards World Model".

What You'll Find Here

We hope to support the research and industrial communities by systematically collecting and organizing influential works that drive progress in video generation for world modeling.

News 🔥

[06/2025] We are hosting CVPR 2025 Tutorial From Video Generation to World Model on June 11!

Updates

This repository is updated periodically. If you have suggestions for additional resources, updates on methodologies, or fixes for expiring links, please feel free to do any of the following:

raise an Issue,
nominate awesome related works with Pull Requests,
For other queries: email both Ziqi ZIQI002 at e dot ntu dot edu dot sg and Jingtong yuejingtong137 at gmail dot com.

1. Generation 1: Faithfulness - Accurate Simulation of the Real World

1.1 Video Foundation Model

Date	Venue	Acronym	Paper
2025-03-04	Arxiv	Helios	Helios: Real Real-Time Long Video Generation Model
2024-12-30	Arxiv	LTX-Video	LTX-Video: Realtime Video Latent Diffusion
2024-12-12	Arxiv	Owl-1	Owl-1: Omni World Model for Consistent Long Video Generation
2024-12-10	Arxiv	STIV	STIV: Scalable Text and Image Conditioned Video Generation
2024-09-24		JT-CV
2024-09		Hailuo AI
2024-06-06		VideoTetris	VideoTetris: Towards Compositional Text-to-Video Generation
2024-02-22		Snap Video	Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
2024-01-23	Arxiv	Lumiere	Lumiere: A Space-Time Diffusion Model for Video Generation
2024-01-17	CVPR24	VideoCrafter2	Videocrafter2:Overcoming data limitations for high-quality video diffusion models
2024-01-09	Arxiv	MagicVideo-V2	MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation
2024-01-05	TMLR25	Latte	Latte: Latent Diffusion Transformer for Video Generation
2023-12-07	Arxiv	HiGen	Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
2023-11-25	Arxiv	SVD	Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
2023-11-07	Arxiv	I2VGen-XL	I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
2023-10-31	ICLR24	SEINE	SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
2023-10-30	Arxiv	VideoCrafter1	Videocrafter1: Open diffusion models for high-quality video generation
2023-10-18	ECCV24	DynamiCrafter	DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
2023-10-09	ICLR24	MAGVIT-v2	Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
2023-09-27	IJCV24	Show-1	Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video
2023-09-26	IJCV24	LaVie	LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
2023-09-01	Arxiv	VideoGen	VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
2023-08-12	Arxiv	ModelScope	ModelScope Text-to-Video Technical Report
2023-07-10	ICLR24	AnimateDiff	AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
2023-06-29	Arxiv	Pika
2023-06-07		Gen-2
2023-02		Gen-1	Gen-1: The Next Step Forward for Generative Al
2022-12-10	CVPR23	MAGVIT	MAGVIT:Masked Generative Video Transformer
2022-11-20	Arxiv	MagicVideo	MagicVideo: Efficient Video Generation With Latent Diffusion Models
2022-10-05	Arxiv	Imagen Video	Imagen Video: High Definition Video Generation with Diffusion Models
2022-09-29	Arxiv	Make-A-Video	Make-A-Video: Text-to-Video Generation without Text-Video Data
2022-05-29	ICLR23	CogVideo	CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

1.2 Other Video Generation Model

1.3 Conditioned World Model

1.3.1 Conditined World Model in General Scene

Geometry Condition

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions (2024-01-03)
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models (2023-11-28)
ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation (2023-10-11)
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation (2023-07-13)
VideoComposer: Compositional Video Synthesis with Motion Controllability (2023-06-03)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning (2023-05-23)
ControlVideo: Training-free Controllable Text-to-Video Generation (2023-05-22)
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators (2023-03-23)
Adding Conditional Control to Text-to-Image Diffusion Models (2023-02-10)

3D Condition

Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models (2024-05-26)
VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model (2024-03-18)
SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion (2024-03-18)
V3D: Video Diffusion Models are Effective 3D Generators (2024-03-11)

Physics Condition

Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback (2024-12-03)
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation (2024-09-27)

Trajectory Navigation

Motion Prompting: Controlling Video Generation with Motion Trajectories (2024-12-03)
SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation (2024-11-07)
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models (2024-06-24)
Image Conductor: Precision Control for Interactive Video Synthesis (2024-06-21)
DragAnything: Motion Control for Anything using Entity Representation (2024-03-12)
InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions (2024-02-05)
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion (2024-02-05)
TrailBlazer: Trajectory Control for Diffusion-Based Video Generation (2023-12-21)
PEEKABOO: Interactive Video Generation via Masked-Diffusion (2023-12-12)
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation (2023-12-06)
Fine-grained Controllable Video Generation via Object Appearance and Context (2023-12-05)

Camera Motion Navigation

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention (2025-03-25)
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video (2025-03-14)
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation (2025-02-12)
AKiRa: Augmentation Kit on Rays for optical video generation (2024-12-18)
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints (2024-12-10)
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers (2024-11-27)
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning (2024-11-07)
Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention (2024-10-14)
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control (2024-07-17)
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation (2024-06-04)
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control (2024-05-27)
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis (2024-05-23)
CameraCtrl: Enabling Camera Control for Text-to-Video Generation (2024-04-02)

Instruction Navigation

Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback (2024-12-03)

Action Navigation

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens (2024-01-18)

1.3.2 Conditined World Model in Robotics

Action Navigation

iVideoGPT: Interactive VideoGPTs are Scalable World Models (2024-05-24)
Model-Based Reinforcement Learning for Atari (2019-03-01)

Instruction Navigation

Learning Universal Policies via Text-Guided Video Generation (2023-01-31)

Goal Navigation

Robot Motion Planning as Video Prediction: A Spatio-Temporal Neural Network-based Motion Planner (2022-12-26)

Hybrid Navigation

RoboDreamer: Learning Compositional World Models for Robot Imagination (2024-04-18)

1.3.3 Conditined World Model in Autonomous Driving

Layout Condition

Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention (2024-12-04)
MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control (2024-09-10)
DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation (2024-09-09)
Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation (2024-06-03)
SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control (2024-03-28)
Panacea: Panoramic and Controllable Video Generation for Autonomous Driving (2023-11-28)
MagicDrive: Street View Generation with Diverse 3D Geometry Control (2023-10-04)

Instruction Navigation

ADriver-I: A General World Model for Autonomous Driving (2023-11-22)

Action Navigation

WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation (2023-12-05)
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving (2023-11-29)
DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving (2023-09-18)
Model-Based Imitation Learning for Urban Driving (2022-10-14)
Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model (2022-10-08)
Iso-Dream: Isolating and Leveraging Noncontrollable Visual Dynamics in World Models (2022-05-27)
DriveGAN: Towards a Controllable High-Quality Neural Simulation (2021-04-30)
Learning a Driving Simulator (2016-08-03)

Hybrid Navigation

GenAD: Generalized Predictive Model for Autonomous Driving (2024-03-14)

Other Navigation

Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception (2025-03-17)

1.3.4 Conditined World Model in Gaming

Controller Navigation

Video2 Game Generation: A Practical Study using Mario (2024)
Playable Video Generation (2021-01-28)
Learning to Simulate Dynamic Environments with GameGAN (2020-05-25)

Action Navigation

Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion Models (2023-03-23)
Transformer-based World Models Are Happy With 100k Interactions (2023-03-13)
Mastering Diverse Domains through World Models (2023-01-10)
Learning General World Models in a Handful of Reward-Free Deployments (2022-10-23)
Transformers are Sample-Efficient World Models (2022-09-05)
Playable Environments: Video Manipulation in Space and Time (2022-03-03)
Mastering Atari with Discrete World Models (2020-10-05)
Dream to Control: Learning Behaviors by Latent Imagination (2019-12-03)
Recurrent Environment Simulators (2017-04-07) Hybrid Navigation

2. Generation 2: Interactiveness - Controllability and Interactive Dynamics

2.1 High-quality World Foundation Model

Date	Venue	Acronym	Paper
2026-03-04	Arxiv	Helios	Helios: Real Real-Time Long Video Generation Model
2026-02-02	Arxiv	Causal Forcing	Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
2026-01-24	Arxiv	SkyReels-V3	SkyReels-V3 Technique Report
2025-12-23	Arxiv	SemanticGen	SemanticGen: Video Generation in Semantic Space
2025-12-18	Arxiv	Kling-Omni	Kling-Omni Technical Report
2025-12-16	Arxiv	MemFlow	MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives
2025-06-18		Hailuo 02
2025-06-10	Arxiv	Seedance 1.0	Seedance 1.0: Exploring the Boundaries of Video Generation Models
2025-06-09	Arxiv	Self Forcing	Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
2025-05-19	Arxiv	MAGI-1	MAGI-1: Autoregressive Video Generation at Scale
2025-05		Veo 3	Veo 3: AI Video Generation with Realistic Sound
2025-04-17	Arxiv	SkyReels-V2	SkyReels-V2: Infinite-length Film Generative Model
2025-04-07		Nova Reel
2025-03-31		Gen-4
2025-03-26	Arxiv	Wan 2.1	Wan: Open and Advanced Large-Scale Video Generative Models
2025-03-13		Step-Video-T2V
2025-03-12	Arxiv	Open-Sora2.0	Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
2025-02-11	Arxiv	Magic 1-For-1	Magic 1-For-1: Generating One Minute Video Clips within One Minute
2025-01-21		MiracleVision
2025-01-15	Arxiv	RepVideo	RepVideo: Rethinking Cross-Layer Representation for Video Generation
2025-01-14	Arxiv	Vchitect-2.0	Vchitect-2.0: Parallel transformer for scaling up video diffusion models
2025-01-07	Arxiv	Cosmos	Cosmos World Foundation Model Platform for Physical AI
2024-12-29	Arxiv	Open-Sora	Open-sora: Democratizing efficient video production for all
2024-12-10	CVPR25	CausVid	From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
2024-12-03	Arxiv	HunyuanVideo	HunyuanVideo: A Systematic Framework For Large Video Generative Models
2024-11-28	Arxiv	Open-Sora Plan	Open-Sora Plan: Open-Source Large Video Generation Model
2024-10-22		Mochi-1
2024-08-12	ICLR25	Cogvideox	Cogvideox:Text-to-video diffusion models with an expert transformer
2024-07-08	Arxiv	Mira	MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions
2024-06-17		Gen-3
2024-06-13		Luma
2024-06-06		Kling
2024-05-29	Arxiv	EasyAnimate	Easyanimate: A high-performance long video generation method based on transformer architecture
2024-05-09		Jimeng
2024-05-07	Arxiv	Vidu	Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models
2024-02-15		Sora	Video generation models as world simulators

World Model Regulation Methods

Stabilization:

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention (2026-01-21)

Inference-time Physics Alignment:

Inference-time Physics Alignment of Video Generative Models with Latent World Models (2026-01-15)

Efficiency:

StableWorld: Towards Stable and Consistent Long Interactive Video Generation (2026-02-02)

Planning Optimization:

Parallel Stochastic Gradient-Based Planning for World Models (2026-01-31)

Long Video Generation Methods

LIVE: Long-horizon Interactive Video World Modeling (2026-02-03)

2.2 Video Generation as World Model in General Scenes

2.2.1 Geometry Condition Prior World Model

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation (2025-12-08)
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework (2025-12-02)
VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning (2025-10-29)
SketchVideo: Sketch-based Video Generation and Editing (2025-03-30)

2.2.2 3D Condition Prior World Model

2.2.3 Physical Prior World Model

Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals (2026-01-09)
PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning (2025-10-15)
WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation (2025-03-11)
Synthetic Video Enhances Physical Fidelity in Video Synthesis (2025-03-26)
PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop (2025-03-12)
PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation (2024-11-30)

2.2.4 Audio Driven World Model

InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing (2025-08-19)
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency (2024-09-04)

2.2.5 Trajectory Navigation World Model

2.2.6 Camera Motion Navigation World Model

VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control (2026-01-08)
AirScape: An Aerial Generative World Model with Motion Controllability (2025-07-10)
CamCloneMaster: Enabling Reference-based Camera Control for Video Generation (2025-06-03)
Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval (2025-06-03)
TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation (2025-04-11)
CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models (2025-03-13)
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control (2025-03-05)
Training-free Camera Control for Video Generation (2024-06-14)

2.2.7 Instruction Navigation World Model

UniVideo: Unified Understanding, Generation, and Editing for Videos (2025-10-09)
Video World Models with Long-term Spatial Memory (2025-06-05)
SlowFast-VGen: Slow-Fast Learning for Action-driven Long Video Generation (2024-10-30)
Pandora: Towards general world model with natural language actions and video states (2024-06-12)

2.2.8 Action Navigation World Model

Grounding World Simulation Models in a Real-World Metropolis (2026-03-16)
WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions (2025-05-23)
Introducing Multiverse: The First AI Multiplayer World Model (2025-05-08)
Aether: Geometric-Aware Unified World Modeling (2025-03-24)

2.3 Video Generation as World Model in Robotics

2.3.1 Action Navigation World Model

Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation (2026-03-17)
Interactive World Simulator for Robot Policy Training and Evaluation (2026-03-09)
BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks (2026-02-03)
Walk through Paintings: Egocentric World Models from Internet Priors (2026-01-21)
Aerial World Model for Long-horizon Visual Generation and Navigation in 3D Space (2025-12-26)
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning (2025-07-16)
EmbodieDreamer: Advancing Real2Sim2Real Transfer for Policy Training via Embodied World Modeling (2025-07-07)
WorldVLA: Towards Autoregressive Action World Model (2025-06-26)
Consistent World Models via Foresight Diffusion (2025-05-22)
Learning 3D Persistent Embodied World Models (2025-05-05)
Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control (2025-03-18)
Unified Video Action Model (2025-02-28)
Pre-Trained Video Generative Models as World Simulators (2025-02-10)
Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression (2025-02-06)
Prediction with Action: Visual Policy Learning via Joint Denoising Process (2024-11-27)
IRASim: Learning Interactive Real-Robot Action Simulators (2024-06-20)

2.3.2 Instruction Navigation World Model

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning (2026-01-22)
ReWorld: Multi-Dimensional Reward Modeling for Embodied World Models (2026-01-18)
InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation (2026-01-05)
VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation (2024-11-14)
EVA: An Embodied World Model for Future Video Anticipation (2024-10-20)
VideoAgent: Self-Improving Video Generation (2024-10-14)
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation (2024-04-16)
Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training (2024-02-22)
Video Language Planning (2023-10-16)
Learning Interactive Real-World Simulators (2023-10-09)
Compositional Foundation Models for Hierarchical Planning (2023-09-15)

2.3.3 Goal Navigation World Model

Grounding Video Models to Actions through Goal Conditioned Exploration (2024-11-11)
Learning to Act from Actionless Videos through Dense Correspondences (2023-10-12)

2.3.4 Hybrid Navigation World Model

Visuo-Tactile World Models (2026-02-05)
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge (2025-07-06)
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control (2025-06-02)
NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants (2025-02-19)
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (2023-12-20)

2.3.5 Real-time Interactive World Model

Evaluating Robot Policies in a World Model (2025-05-31)
Learning Interactive Real-World Simulators (2023-10-09)

2.4. Video Generation as World Model in Autonomous Driving

2.4.1 Layout Prior World Model

DriveFix: Spatio-Temporally Coherent Driving Scene Restoration (2026-03-17)
ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask (2026-02-03)
UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving (2026-02-02)
Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models (2025-06-10)
CoGen: 3D Consistent Video Generation via Adaptive Conditioning for Autonomous Driving (2025-03-28)
MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving (2025-03-20)
UniScene: Unified Occupancy-centric Driving Scene Generation (2024-12-06)
DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes (2024-09-06)
DiVE: DiT-based Video Generation with Enhanced Control (2024-09-03)
DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model (2023-10-11)

2.4.2 Instruction Navigation World Model

DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation (2024-03-11)

2.4.3 Trajectory Navigation World Model

InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation (2026-02-03)
UniDWM: Towards a Unified Driving World Model via Multifaceted Representation Learning (2026-02-02)
DISK: Dynamic Inference SKipping for World Models (2026-01-31)
MAD: Motion Appearance Decoupling for efficient Driving World Models (2026-01-14)
Epona: Autoregressive Diffusion World Model for Autonomous Driving (2025-06-30)
Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space (2025-03-12)
DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT (2024-12-27)
GEM:AGeneralizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control (2024-12-15)
Doe-1: Closed-Loop Autonomous Driving with Large World Model (2024-12-12)
ACT-BENCH:Towards Action Controllable World Models for Autonomous Driving (2024-12-06)
MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control (2024-11-21)
MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes (2024-05-23)

2.4.4 Action Navigation World Model

Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving (2026-01-29)
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers (2024-12-24)
InfinityDrive: Breaking Time Limits in Driving World Models (2024-12-02)
GAIA-1: A Generative World Model for Autonomous Driving (2023-09-29)

2.4.5 Hybrid Navigation World Model

2.4.6 Other Navigation World Model

Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models (2025-07-17)
Physical Informed Driving World Model (2024-12-11)

2.5 Video Generation as World Model in Gaming

2.5.1 Controller Navigation World Model

WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation (2026-03-17)
Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory (2026-02-03)
The World's First AI-Native UGC Game Engine Powered by Real-Time World Model (2025-07-03)
Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition (2025-06-20)
WORLDMEM: Long-term Consistent World Simulation with Memory (2025-04-16)
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft (2025-04-11)
Model as a Game: On Numerical and Spatial Consistency for Generative Games (2025-03-27)
World and Human Action Models towards gameplay ideation (2025-02-19)
GameFactory: Creating New Games with Generative Interactive Videos (2025-01-14)
Genie 2: A large-scale foundation world model (2024-12-04)
The Matrix: Infinite-Horizon World Generation with Real-Time Moving Control (2024-12-04)
Playable Game Generation (2024-12-01)
GameGen-X: Interactive Open-world Game Video Generation (2024-11-01)
Oasis: A Universe in a Transformer (2024-10-31)
Diffusion Models Are Real-Time Game Engines (2024-08-27)
Diffusion for World Modeling: Visual Details Matter in Atari (2024-05-20)
Genie: Generative Interactive Environmentsl (2024-02-23)

2.5.2 Action Navigation World Model

Accurate and Efficient World Modeling with Masked Latent Transformers (2025-07-05)
Long-Context State-Space Video World Models (2025-05-26)
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning (2024-11-07)

2.5.3 Hybrid Navigation World Model

AdaWorld: Learning Adaptable World Models with Latent Actions (2025-03-24)

3. Generation 3: Planning - Modeling the Future Evolution of Complex Systems

Date	Venue	Acronym	Paper
2026-01-28	Arxiv	LingBot-World	Advancing Open-source World Models
2025-12-26	Arxiv	Yume1.5	Yume1.5: A Text-Controlled Interactive World Generation Model
2025-12-16	Arxiv	WorldPlay	WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
2025-12-09	ICLR26	Astra	Astra: General Interactive World Model with Autoregressive Denoising
2025-07-17		MirageLSD	MirageLSD: The First Live-Stream Diffusion AI Video Model
2025-06-11	Arxiv	V-JEPA 2	V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
2024-12-04	CVPR25	NWM	Navigation World Models

For Robotics:

Causal World Modeling for Robot Control (2026-01-29)
An Efficient and Multi-Modal Navigation System with One-Step World Model (2026-01-18)

Note: Action and goal navigation for robotics.

4. Generation 4: Counterfactual and Outlier Modeling

4.1 Macroscopic Scale World Model

4.2 Mesoscopic Scale World Model

4.3 Microscopic Scale World Model

WonderZoom: Multi-Scale 3D World Generation (2025-12-09)

5. Evaluation and Datasets

5.1 Evaluation Metrics of Video Generation

Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding (2025-07-20)
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness (2025-03-27)
FullDiT: Multi-Task Video Generative Foundation Model with Full Attention (2025-03-25)
VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation (2025-03-09)
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models (2024-11-20)
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation (2024-07-19)
VideoPhy: Evaluating Physical Commonsense for Video Generation (2024-06-05)
VBench: Comprehensive Benchmark Suite for Video Generative Models (2023-11-29)
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation (2023-11-03)
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models (2023-10-17)

5.2 Evaluation Metrics of World Model

WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models (2026-02-09)
WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models (2026-01-29)
Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models (2026-01-27)
PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models (2026-01-22)
DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving (2026-01-04)
Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test (2026-01-07)
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation (2025-06-27)
Evaluating Robot Policies in a World Model (2025-05-31)
WorldScore: A Unified Evaluation Benchmark for World Generation (2025-04-01)
WorldModelBench: Judging Video Generation Models As World Models (2025-02-28)
WorldSimBench: Towards Video Generation Models as World Simulators (2024-10-23)
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation (2024-10-07)

5.3 Datasets

Action100M: A Large-scale Video Action Dataset (2026-01-15)
VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing (2024-11-22)
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content (2024-10-10)
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation (2024-08-05)
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (2024-07-08)
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers (2024-02-29)
Consistent Video-to-Video Transfer Using Synthetic Dataset (2023-11-01)
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions (2021-11-19)
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval (2021-04-01)
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips (2019-06-07)
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research (2019-04-06)
Localizing Moments in Video with Natural Language (2017-08-04)
Towards Automatic Learning of Procedures from Web Instructional Videos (2017-03-28)
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language (2016-12-12)
ActivityNet: A large-scale video benchmark for human activity understanding (2015-10-15)
A Dataset for Movie Description (2015-01-12)

6. Study and Rethinking

6.1 Survey

From Perception to Action: Spatial AI Agents and World Models (2026-02-02)
A Mechanistic View on Video Generation as World Models: State and Dynamics (2026-01-22)
From Generative Engines to Actionable Simulators: The Imperative of Physical Grounding in World Models (2026-01-21)
Video Generation Models in Robotics - Applications, Research Challenges, Future Directions (2026-01-12)
Modeling the Mental World for Embodied AI: A Comprehensive Review (2025-12-17)
3D and 4D World Modeling: A Survey (2025-09-04)
Reconstructing 4D Spatial Intelligence: A Survey (2025-07-28)
A Survey: Learning Embodied Intelligence from Physical Simulators and World Models (2025-07-01)
A Survey of Interactive Generative Video (2025-04-30)
Survey of Video Diffusion Models: Foundations, Implementations, and Applications (2025-04-22)
Exploring the Evolution of Physics Cognition in Video Generation: A Survey (2025-03-27)
Generative Physical AI in Vision: A Survey (2025-01-19)
Joint Perception and Prediction for Autonomous Driving: A Survey (2024-12-18)
Understanding World or Predicting Future? A Comprehensive Survey of World Models (2024-11-21)
Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey (2024-11-05)
From Sora What We Can See: A Survey of Text-to-Video Generation (2024-05-17)
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond (2024-05-06)
Video Diffusion Models: A Survey (2024-05-06)
Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation (2024-03-08)
World Models for Autonomous Driving: An Initial Survey (2024-03-05)
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models (2024-02-27)
A Survey on Video Diffusion Models (2023-10-16)
Video Generative Adversarial Networks: A Review (2020-11-04)

6.2 Position & Perspective

Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks (2026-02-02)
An Empirical Study of World Model Quantization (2026-02-02)
Learning Latent Action World Models In The Wild (2026-01-08)
General agents need world models (2025-06-02)
Position: Interactive Generative Video as Next-Generation Game Engine (2025-03-21)
How Far is Video Generation from World Model: A Physical Law Perspective (2024-11-04)
Video as the New Language for Real-World Decision Making (2024-02-27)
A Path Towards Autonomous Machine Intelligence Version (2022-06-07)

7. Downstream Tasks for World Modeling

7.1 World Models as Data Generators

7.2 World Models as Reasoning Proxy

DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models (2026-03-17)
NavThinker: Action-Conditioned World Models for Coupled Prediction and Planning in Social Navigation (2026-03-16)
World-Gymnast: Training Robots with Reinforcement Learning in a World Model (2026-02-02)
World Models as an Intermediary between Agents and the Real World (2026-01-31)
TC-IDM: Grounding Video Generation for Executable Zero-shot Robot Motion (2026-01-26)
Mirage2Matter: A Physically Grounded Gaussian World Model from Video (2026-01-24)

8. World Modele for Other Application

8.1 World Models for Medicine

EHRWorld: A Patient-Centric Medical World Model for Long-Horizon Clinical Trajectories (2026-02-03)

Citation

If you find this paper useful, please consider citing:

@article{yue2025video,
  title={Simulating the World Model with Artificial Intelligence: A Roadmap},
  author={Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, Ziwei Liu},
  journal={arXiv preprint arXiv:2511.08585},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 292 Commits
README.md		README.md
fig_teaser.png		fig_teaser.png

Folders and files

Latest commit

History

Repository files navigation

Awesome From Video Generation to World Model

Overview

What You'll Find Here

News 🔥

Updates

Table of Contents

1. Generation 1: Faithfulness - Accurate Simulation of the Real World

1.1 Video Foundation Model

1.2 Other Video Generation Model

1.2.1 GAN Based Video Generation

1.2.2 U-Net Based Video Generation

1.2.3 DiT Based Video Generation

1.2.4 Autoregressive Based Video Generation

1.3 Conditioned World Model

1.3.1 Conditined World Model in General Scene

1.3.2 Conditined World Model in Robotics

1.3.3 Conditined World Model in Autonomous Driving

1.3.4 Conditined World Model in Gaming

2. Generation 2: Interactiveness - Controllability and Interactive Dynamics

2.1 High-quality World Foundation Model

2.2 Video Generation as World Model in General Scenes

2.2.1 Geometry Condition Prior World Model

2.2.2 3D Condition Prior World Model

2.2.3 Physical Prior World Model

2.2.4 Audio Driven World Model

2.2.5 Trajectory Navigation World Model

2.2.6 Camera Motion Navigation World Model

2.2.7 Instruction Navigation World Model

2.2.8 Action Navigation World Model

2.3 Video Generation as World Model in Robotics

2.3.1 Action Navigation World Model

2.3.2 Instruction Navigation World Model

2.3.3 Goal Navigation World Model

2.3.4 Hybrid Navigation World Model

2.3.5 Real-time Interactive World Model

2.4. Video Generation as World Model in Autonomous Driving

2.4.1 Layout Prior World Model

2.4.2 Instruction Navigation World Model

2.4.3 Trajectory Navigation World Model

2.4.4 Action Navigation World Model

2.4.5 Hybrid Navigation World Model

2.4.6 Other Navigation World Model

2.5 Video Generation as World Model in Gaming

2.5.1 Controller Navigation World Model

2.5.2 Action Navigation World Model

2.5.3 Hybrid Navigation World Model

3. Generation 3: Planning - Modeling the Future Evolution of Complex Systems

4. Generation 4: Counterfactual and Outlier Modeling

4.1 Macroscopic Scale World Model

4.2 Mesoscopic Scale World Model

4.3 Microscopic Scale World Model

5. Evaluation and Datasets

5.1 Evaluation Metrics of Video Generation

5.2 Evaluation Metrics of World Model

5.3 Datasets

6. Study and Rethinking

6.1 Survey

6.2 Position & Perspective

7. Downstream Tasks for World Modeling

7.1 World Models as Data Generators

7.2 World Models as Reasoning Proxy

8. World Modele for Other Application

8.1 World Models for Medicine

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages