🧩 Abstract 🧩
Significant progress has been achieved in high-fidelity video synthesis, yet current paradigms often fall short in effectively integrating identity information from multiple subjects. This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. ID-Crafter incorporates a hierarchical identity-preserving attention mechanism and a VLM that performs reasoning on the multimodal input into a video DiT to enable multi-subject video generation. An online RL stage further refines the concept alignment.
🔮 Method 🔮
Architecture
The capability of ID-Crafter in generating multi-subject videos from a text prompt and multiple reference images is achieved by (1) Hierarchical Identity-Preserving Attention, which aggregates features both within and across subjects and modalities, ensuring identity consistency and faithful textual alignment; (2) Semantic Guidance via Pretrained Vision-Language Models (VLMs), leveraging VLMs' rich semantic understanding to capture fine-grained interactions among multiple subjects and modalities. (3) An online reinforcement learning phase is further employed to enhance video quality and preserve subject identities across time. Extensive experiments show that ID-Crafter outperforms previous methods in identity preservation, temporal consistency, and overall video quality.
🎬 Results 🎬
Multi-Reference Subject-to-Video Generation
ID-Crafter can generate natural and dynamic details across multiple subjects from given text, such as fabric wrinkles induced by movement, wind-swept hair, and crumbs scattered from torn bread.
ID-preserving Generation
ID-Crafter faithfully preserves the identity of reference subjects (including people and items) while producing vivid videos aligned with the given prompt.
📝 Comparison 📝
📝 Benchmark 📝
📊 Data Curation 📊
📄 BibTeX 📄
@misc{pan2025idcraftervlmgroundedonlinerl,
title={ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation},
author={Panwang Pan and Jingjing Zhao and Yuchen Lin and Chenguo Lin and Chenxin Li and Hengyu Liu and Tingting Shen and Yadong MU},
year={2025},
eprint={2511.00511},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.00511},
}