Text2LIVE: Text-Driven Layered Image and Video Editing

Supplementary Material

Green-Screen Ablation and Baseline Comparisons

Green-Screen Ablation
Comparison to Image Baselines (Section 4)
Prior Work: StyleGAN Image Manipulation
Comparison to Video Baselines (Section 4)

We recommend watching all images and videos in full screen. Click on the images or videos for seeing them in full scale.

Green-Screen Ablation

We provide another ablation for our Green-Screen Loss. As discussed in Sec. 3.1., a nice property of this loss is that it allows intuitive supervision on a desired effect. For example, when generating semi-transparent effects. We can use this loss to focus on the fire regardless of the image content. Here, we used "smoking cigar" as composition text, and "smoke" for green-screen.

Input image	w/o green-screen	w/o green-screen (edit layer)	Full objective	Full objective (edit layer)

Comparison to Image Baselines

To the best of our knowledge, there is no method tailored for solving our task: text-driven semantic, localized editing of existing objects in real-world images and videos. We next provide additional comparisons to several prominent text-driven image editing methods included in the human perceptual evaluation, that can be applied to a similar setting to ours: editing real-world images that are not restricted to specific domains. See discussion and human perceptual evaluation results in Sec. 4.3.

CLIP-Styler modifies the image in a global artistic manner, rather than performing local semantic editing.
Diffusion+CLIP can often synthesize high-quality images, but with either low-fidelity to the target text, or with low-fidelity to the input image content.
VQ-GAN+CLIP fails to maintain fidelity to the input image and produces non-realistic images.

"smoking cigar"	CLIPStyler	Diffusion+CLIP	VQ-GAN+CLIP	Ours

"golden butterfly"	CLIPStyler	Diffusion+CLIP	VQ-GAN+CLIP	Ours

"orca"	CLIPStyler	Diffusion+CLIP	VQ-GAN+CLIP	Ours

"fire out of bear's mouth"	CLIPStyler	Diffusion+CLIP	VQ-GAN+CLIP	Ours

"latte heart pattern"	CLIPStyler	Diffusion+CLIP	VQ-GAN+CLIP	Ours

"ice"	CLIPStyler	Diffusion+CLIP	VQ-GAN+CLIP	Ours

"golden birds"	CLIPStyler	Diffusion+CLIP	VQ-GAN+CLIP	Ours

"golden horse"	CLIPStyler	Diffusion+CLIP	VQ-GAN+CLIP	Ours

"oreo cake"	CLIPStyler	Diffusion+CLIP	VQ-GAN+CLIP	Ours

"spinach moss cake"	CLIPStyler	Diffusion+CLIP	VQ-GAN+CLIP	Ours

Prior Work: StyleGAN Image Manipulation

We illustrate the differences between our method and text-guided StyleGAN-based image manipulation methods (StyleCLIP and StyleGAN-NADA).

Both StyleCLIP and StyleGAN-NADA are designed to edit StyleGAN generated images. Editing real-world images requires solving the GAN inversion challenge.
The domain of input images is restricted. For StyleCLIP the domain of output images is also restricted.
Both edit the input image in a global manner, and are not designed for localized edits.

First row: current state-of-the-art StyleGAN encoder + StyleCLIP. The image can be reliably inverted once it is aligned, yet StyleCLIP fails to edit the hat.

Second row: e4e + StyleGAN-NADA (we used an encoder which is applicable when changing StyleGAN weights). The given image cannot be reliably inverted, and the hat cannot be properly edited. Note that in StyleGAN-NADA the entire domain needs to satisfy the edit (e.g., all face images should satisfy "person wearing red hat").

Third row: editing a StyleGAN-generated image (inversion is not required). Note that although the edit refers only to the wheels, StyleGAN-NADA modifies the entire car (shape and colors) as well as the background.

Our method automatically performs automatic localized edits to arbitrary images from various domains.

"person" to "person wearing red hat"	StyleGAN Inversion (HyperStyle [1])	StyleCLIP	Ours

"person" to "person wearing red hat"	StyleGAN Inversion (e4e [2])	StyleGAN-NADA	Ours

"Chrome wheels" to "TRON wheels"	StyleGAN-NADA	Ours

Comparison to Video Baselines

We quantify the effectiveness of our key design choices for the video-editing by comparing our video method against: (i) Atlas Baseline: feeding the discretized 2D Atlas to our single-image method (Sec. 3.1), and using the same inference pipeline illustrated in Fig. 4 to map the edited atlas back to frames. (ii) Frames Baseline: treating all video frames as part of a single internal dataset, used to train our generator; at inference, we apply the trained generator independently to each frame. We note that the Atlas baseline does not utilize the richness of video and produces lower-quality texture results. The Frames baseline produces temporally inconsistent edits.

"swarovski blue crystal swan"

"rusty jeep"

"blue dress"

"ocean at sunset"

"giraffe with a hairy colorful mane"

[1] "HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing", Alaluf et al., arXiv 2021

[2] "Designing an encoder for stylegan image manipulation", Tov et al., ACM Transactions on Graphics (TOG), 2021