We recommend watching all images and videos in full screen. Click on the images or videos for seeing them in full scale.
We provide another ablation for our Green-Screen Loss. As discussed in Sec. 3.1., a nice property of this loss is that it allows intuitive supervision on a desired effect. For example, when generating semi-transparent effects. We can use this loss to focus on the fire regardless of the image content. Here, we used "smoking cigar" as composition text, and "smoke" for green-screen.
| Input image | w/o green-screen | w/o green-screen (edit layer) | Full objective | Full objective (edit layer) |
|---|---|---|---|---|
|
|
|
|
|
To the best of our knowledge, there is no method tailored for solving our task: text-driven semantic, localized editing of existing objects in real-world images and videos. We next provide additional comparisons to several prominent text-driven image editing methods included in the human perceptual evaluation, that can be applied to a similar setting to ours: editing real-world images that are not restricted to specific domains. See discussion and human perceptual evaluation results in Sec. 4.3.
We illustrate the differences between our method and text-guided StyleGAN-based image manipulation methods (StyleCLIP and StyleGAN-NADA).
First row: current state-of-the-art StyleGAN encoder + StyleCLIP. The image can be reliably inverted once it is aligned, yet StyleCLIP fails to edit the hat.
Second row: e4e + StyleGAN-NADA (we used an encoder which is applicable when changing StyleGAN weights). The given image cannot be reliably inverted, and the hat cannot be properly edited. Note that in StyleGAN-NADA the entire domain needs to satisfy the edit (e.g., all face images should satisfy "person wearing red hat").
Third row: editing a StyleGAN-generated image (inversion is not required). Note that although the edit refers only to the wheels, StyleGAN-NADA modifies the entire car (shape and colors) as well as the background.
Our method automatically performs automatic localized edits to arbitrary images from various domains.
| "person" to "person wearing red hat" | StyleGAN Inversion (HyperStyle [1]) | StyleCLIP | Ours |
|---|---|---|---|
|
|
|
|
| "person" to "person wearing red hat" | StyleGAN Inversion (e4e [2]) | StyleGAN-NADA | Ours |
|
|
|
|
| "Chrome wheels" to "TRON wheels" | StyleGAN-NADA | Ours |
|---|---|---|
|
|
|
We quantify the effectiveness of our key design choices for the video-editing by comparing our video method against: (i) Atlas Baseline: feeding the discretized 2D Atlas to our single-image method (Sec. 3.1), and using the same inference pipeline illustrated in Fig. 4 to map the edited atlas back to frames. (ii) Frames Baseline: treating all video frames as part of a single internal dataset, used to train our generator; at inference, we apply the trained generator independently to each frame. We note that the Atlas baseline does not utilize the richness of video and produces lower-quality texture results. The Frames baseline produces temporally inconsistent edits.
[1] "HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing", Alaluf et al., arXiv 2021
[2] "Designing an encoder for stylegan image manipulation", Tov et al., ACM Transactions on Graphics (TOG), 2021