Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. This is non-trivial because it induces extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. The core innovation lies in harnessing the generative capabilities of pre-trained text-to-video models through a thoroughly explored video conditioning mechanism. Considering the scarcity of qualified training data, we constructed a large-scale multi-camera synchronized video dataset using Unreal Engine 5, which is carefully curated to follow real-world filming characteristics, covering diverse scenes and camera movements. It helps the generalization of trained models to in-the-wild videos. Lastly, we further improve the robustness to diverse input through a meticulously designed training strategy. Extensive experiments tell that our method substantially outperforms existing state-of-the-art approaches and strong baselines. Our method also finds promising applications in video stabilization, super-resolution, and outpainting. Our code and dataset will be publicly available.
To re-shoot a source video with novel camera trajectories, we propose to harness the generative capability of pre-trained text-to-video diffusion models by imposing dual conditions, i.e. the source video and target camera trajectories through a meticulously designed framework. The overview of the model is depicted below.
Left: The training pipeline of ReCamMaster. A latent diffusion model is optimized to reconstruct the target video Vt , conditioned on the source video Vs , target camera pose camt , and target prompt pt . Right: Comparison of different video condition techniques. (a) Frame-dimension conditioning used in our paper; (b) Channel-dimension conditioning used in baseline methods [1], [2]; (c) View-dimension conditioning in [3].
In our paper, we propose a novel video conditioning scheme that concatenates the tokens of a source video with the target video tokens along the frame dimension. To verify the effectiveness, we compared the "channel concatenation" technique used in baseline methods [1], [2] with the "view concatenation" technique from [3]. It is evident that the designed conditioning technique significantly enhances the model's performance.