Ω

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Jianhong Bai¹, Menghan Xia², Xiao Fu³, Xintao Wang², Lianrui Mu¹, Jinwen Cao²,
Zuozhu Liu¹, Haoji Hu¹, Xiang Bai⁴, Pengfei Wan², Di Zhang²

¹Zhejiang University, ²Kling Team, Kuaishou Technology, ³CUHK, ⁴HUST

arXiv Code (try ReCamMaster)

TL;DR: We propose ReCamMaster to re-capture in-the-wild videos with novel camera trajectories, accomplished through our proposed simple and effective video conditioning scheme. We illustrate the applicability of our model across various domains, including filmmaking, 4D reconstruction, video stabilization, autonomous driving, and embodied AI. We also release a multi-camera synchronized video dataset rendered with Unreal Engine 5.

Demos

Arc Trajectories

Source Videos Synthesized Videos Source Videos Synthesized Videos

Translation Up Trajectories

Source Videos Synthesized Videos Source Videos Synthesized Videos

Translation Down Trajectories

Source Videos Synthesized Videos Source Videos Synthesized Videos

Pan Trajectories

Source Videos Synthesized Videos Source Videos Synthesized Videos

Tilt Trajectories

Source Videos Synthesized Videos Source Videos Synthesized Videos

Zoom in / Zoom out Trajectories

Source Videos Synthesized Videos Source Videos Synthesized Videos

More Complex Trajectories

Source Videos Synthesized Videos Source Videos Synthesized Videos

Source Videos Synthesized Videos Source Videos Synthesized Videos

Application in 4D Reconstruction

Source Videos Synthesized Videos Source Videos Synthesized Videos

Application in Video Stabilization

For amateur videographers or when shooting with a handheld camera, obtaining stable video is challenging. Video stabilization techniques aim to smooth out camera movements to produce easy-to-watch videos, which can be achieved by inputting smooth camera trajectories into ReCamMaster. To verify this, we used unsteady videos from the DeepStab dataset (consisting of unsteady videos collected via handheld hardware) as input to the model and obtained stable videos as output. It can be observed that the model stabilizes the video while preserving the scenes and actions from the original video.

Source Videos Synthesized Videos Source Videos Synthesized Videos

Application in Embodied AI

In the realm of Embodied AI, creating large-scale, high-quality datasets like Bridge or AGIBOT can be extremely expensive. ReCamMaster offers a solution by altering video perspectives, making it an effective tool for data augmentation. It could supply robots with multi-perspective observation data, thereby enhancing the performance of downstream tasks.

Source Videos Synthesized Videos Source Videos Synthesized Videos

Application in Autonomous Driving

We have discovered that ReCamMaster demonstrates promising generalization capabilities in autonomous driving scenarios. Consequently, it can serve as an effective data augmentation tool in autonomous driving.

Source Videos Synthesized Videos Source Videos Synthesized Videos

Abstract

Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. This is non-trivial because it induces extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. The core innovation lies in harnessing the generative capabilities of pre-trained text-to-video models through a thoroughly explored video conditioning mechanism. Considering the scarcity of qualified training data, we constructed a large-scale multi-camera synchronized video dataset using Unreal Engine 5, which is carefully curated to follow real-world filming characteristics, covering diverse scenes and camera movements. It helps the generalization of trained models to in-the-wild videos. Lastly, we further improve the robustness to diverse input through a meticulously designed training strategy. Extensive experiments tell that our method substantially outperforms existing state-of-the-art approaches and strong baselines. Our method also finds promising applications in video stabilization, super-resolution, and outpainting. Our code and dataset will be publicly available.

Method

To re-shoot a source video with novel camera trajectories, we propose to harness the generative capability of pre-trained text-to-video diffusion models by imposing dual conditions, i.e. the source video and target camera trajectories through a meticulously designed framework. The overview of the model is depicted below.

Left: The training pipeline of ReCamMaster. A latent diffusion model is optimized to reconstruct the target video V_t , conditioned on the source video V_s , target camera pose cam_t , and target prompt p_t . Right: Comparison of different video condition techniques. (a) Frame-dimension conditioning used in our paper; (b) Channel-dimension conditioning used in baseline methods [1], [2]; (c) View-dimension conditioning in [3].

Comparisons

We compare the proposed ReCamMaster with state-of-the-art camera-controlled video-to-video generation methods including GCD [1], Trajectory-Attention [4], and DaS [5].

Ablation on Video Conditioning Mechanisms

In our paper, we propose a novel video conditioning scheme that concatenates the tokens of a source video with the target video tokens along the frame dimension. To verify the effectiveness, we compared the "channel concatenation" technique used in baseline methods [1], [2] with the "view concatenation" technique from [3]. It is evident that the designed conditioning technique significantly enhances the model's performance.

Reference:
[1] Van Hoorick, Basile, et al. "Generative camera dolly: Extreme monocular dynamic novel view synthesis." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
[2] Bian, Weikang, et al. "GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking." arXiv preprint arXiv:2501.02690 (2025).
[3] Bai, Jianhong, et al. "SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints." arXiv preprint arXiv:2412.07760 (2024).
[4] Zeqi Xiao, et al. "Trajectory attention for fine-grained video motion control." The Thirteenth International Conference on Learning Representations, 2025.
[5] Gu, Zekai, et al. "Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control." arXiv preprint arXiv:2501.03847 (2025).

Acknowledgments:
We thank Jinwen Cao, Yisong Guo, Haowen Ji, Jichao Wang, and Yi Wang from Kuaishou Technology for their invaluable help in constructing the training dataset.