Text2LIVE: Text-Driven Layered Image and Video Editing

Supplementary Material


We recommend watching all images and videos in full screen. Click on the images or videos for seeing them in full scale.


Image Results

Sample results of our image framework (part of which are shown in Fig. 5). Note how objects are "painted" in a semantic-aware manner, guided only by text, without any user provided masks.

Input image "oreo cake" "ice" "brioche" "spinach moss cake"
Input image "red velvet" "ice" "brioche" "melted cheese"
Input image "snow" "volcano" "sahara" "ocean"
Input image "golden ball" "wooden ball" "stained glass ball" "crochet ball"
Input image "golden birds" "wooden birds" "stained glass birds" "crochet birds"
Input image "golden dolphin" "wooden dolphin" "orca" "crochet dolphin"


Image Results: Semi-Transparent Effects

Our method can generate complex semi-transparent effects. Notice the accurate localization of the fire, achieved with our bootstrapping (the localization text is "mouth").

Input image green screen: "fire" composition: "fire out of bear's mouth"
Input image green screen: "smoke" composition: "smoking cigar"
Input image green screen: "fire" composition: "car on fire"
Input image green screen: "latte art heart pattern" composition: "latte art heart pattern"


Video Results

We show the original video and the edited video for different kinds of target text prompts.





"Lucia" - Background Edits




"Car-turn" - Background Edits


"Lucia" - Foreground Edits


"Car-turn" - Foreground Edits






Two Layers Edit

We show here an example of editing both layers of the video (foreground and background).


Original atlas Edited atlas
Original atlas Edited atlas



CLIP often exhibit strong association between text and certain visual elements such as the shape of objects (e.g., “moon” with crescent shape in the top row), or additional new objects (e.g., “birthday cake” with candles in the bottom row). As our method is designed to edit existing objects, generating new ones may not lead to a visually pleasing result. However, often the desired edit can be achieved by using more specific text

Input image "moon" "a bright full moon"
Input image "chess cake" "birthday cake"