Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset CVPR 2025
- Yiqun Mei 1,2
- Mingming He 1
- Li Ma 1
- Julien Philip 1
- Wenqi Xian1
- David M George 1
- Xueming Yu 1
- Gabriel Dedic 1
- Ahmet Levent Taşel 1
- Ning Yu 1
- Vishal M. Patel 2
- Paul Debevec 1
- 1 Netflix Eyeline Studios
- 2 Johns Hopkins University
Lux Post Facto offers portrait relighting as a simple post-production process. Users can edit the lighting of portrait images (first
row) and videos (second row) with high fidelity using any HDR map. Our method is temporally stable and highly photorealistic.
Abstract
Video portrait relighting remains challenging because the results need to be both photorealistic and temporally stable. This typically requires a strong model design that can capture complex facial reflections as well as intensive training on a high-quality paired video dataset, such as dynamic one-light-at-a-time (OLAT). In this work, we introduce Lux Post Facto, a novel portrait video relighting method that produces both photorealistic and temporally consistent lighting effects. From the model side, we design a new conditional video diffusion model built upon state-of-the-art pre-trained video diffusion model, alongside a new lighting injection mechanism to enable precise control. This way we leverage strong spatial and temporal generative capability to generate plausible solutions to the ill-posed relighting problem. Our technique uses a hybrid dataset consisting of static expression OLAT data and in-the-wild portrait performance videos to jointly learn relighting and temporal modeling. This avoids the need to acquire paired video data in different lighting conditions. Our extensive experiments show that our model produces state-of-the-art results both in terms of photorealism and temporal consistency.
Method
To relight an input video, a delighting model predicts an albedo video (a) which is then relit by a relighting model (b). Both models share the same architecture (c) based on stable video diffusion [3] (SVD). We condition the SVD on the input video by concatenating input latents to the Gaussian noise. To support autoregressive prediction for long sequence, we replace the first T frames with previous predictions, indicated with a binary mask concatenated to the input. The output lighting is controlled by an HDR map, converted to a light embedding fed to the U-Net through cross-attention layers. The VAE that encodes and decodes the latents is omitted for clarity.
BibTeX
@article{mei2025lux,
title={Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset},
author={Mei, Yiqun and He, Mingming and Ma, Li and Philip, Julien and Xian, Wenqi and George, David M and Yu, Xueming and Dedic, Gabriel and Taşel, Ahmet Levent and Yu, Ning and Patel, Vishal M and Debevec, Paul},
journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}