Text2LIVE: Text-Driven Layered Image and Video Editing

Supplementary Material

Green-Screen Ablation and Baseline Comparisons

Back to main page

 


We recommend watching all images and videos in full screen. Click on the images or videos for seeing them in full scale.

 


Green-Screen Ablation

We provide another ablation for our Green-Screen Loss. As discussed in Sec. 3.1., a nice property of this loss is that it allows intuitive supervision on a desired effect. For example, when generating semi-transparent effects. We can use this loss to focus on the fire regardless of the image content. Here, we used "smoking cigar" as composition text, and "smoke" for green-screen.

Input image w/o green-screen w/o green-screen (edit layer) Full objective Full objective (edit layer)

 


Comparison to Image Baselines

To the best of our knowledge, there is no method tailored for solving our task: text-driven semantic, localized editing of existing objects in real-world images and videos. We next provide additional comparisons to several prominent text-driven image editing methods included in the human perceptual evaluation, that can be applied to a similar setting to ours: editing real-world images that are not restricted to specific domains. See discussion and human perceptual evaluation results in Sec. 4.3.

"smoking cigar" CLIPStyler Diffusion+CLIP VQ-GAN+CLIP Ours
"golden butterfly" CLIPStyler Diffusion+CLIP VQ-GAN+CLIP Ours
"orca" CLIPStyler Diffusion+CLIP VQ-GAN+CLIP Ours
"fire out of bear's mouth" CLIPStyler Diffusion+CLIP VQ-GAN+CLIP Ours
"latte heart pattern" CLIPStyler Diffusion+CLIP VQ-GAN+CLIP Ours
"ice" CLIPStyler Diffusion+CLIP VQ-GAN+CLIP Ours
"golden birds" CLIPStyler Diffusion+CLIP VQ-GAN+CLIP Ours
"golden horse" CLIPStyler Diffusion+CLIP VQ-GAN+CLIP Ours
"oreo cake" CLIPStyler Diffusion+CLIP VQ-GAN+CLIP Ours
"spinach moss cake" CLIPStyler Diffusion+CLIP VQ-GAN+CLIP Ours

 


Prior Work: StyleGAN Image Manipulation

We illustrate the differences between our method and text-guided StyleGAN-based image manipulation methods (StyleCLIP and StyleGAN-NADA).

First row: current state-of-the-art StyleGAN encoder + StyleCLIP. The image can be reliably inverted once it is aligned, yet StyleCLIP fails to edit the hat.

Second row: e4e + StyleGAN-NADA (we used an encoder which is applicable when changing StyleGAN weights). The given image cannot be reliably inverted, and the hat cannot be properly edited. Note that in StyleGAN-NADA the entire domain needs to satisfy the edit (e.g., all face images should satisfy "person wearing red hat").

Third row: editing a StyleGAN-generated image (inversion is not required). Note that although the edit refers only to the wheels, StyleGAN-NADA modifies the entire car (shape and colors) as well as the background.

Our method automatically performs automatic localized edits to arbitrary images from various domains.

"person" to "person wearing red hat" StyleGAN Inversion (HyperStyle [1]) StyleCLIP Ours
"person" to "person wearing red hat" StyleGAN Inversion (e4e [2]) StyleGAN-NADA Ours
"Chrome wheels" to "TRON wheels" StyleGAN-NADA Ours

 


Comparison to Video Baselines

We quantify the effectiveness of our key design choices for the video-editing by comparing our video method against: (i) Atlas Baseline: feeding the discretized 2D Atlas to our single-image method (Sec. 3.1), and using the same inference pipeline illustrated in Fig. 4 to map the edited atlas back to frames. (ii) Frames Baseline: treating all video frames as part of a single internal dataset, used to train our generator; at inference, we apply the trained generator independently to each frame. We note that the Atlas baseline does not utilize the richness of video and produces lower-quality texture results. The Frames baseline produces temporally inconsistent edits.

"swarovski blue crystal swan"

 

"rusty jeep"

 

"blue dress"

 

"ocean at sunset"

 

"giraffe with a hairy colorful mane"

 


[1] "HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing", Alaluf et al., arXiv 2021

[2] "Designing an encoder for stylegan image manipulation", Tov et al., ACM Transactions on Graphics (TOG), 2021