MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path

Abstract

Image generation using diffusion can be controlled in multiple ways. In this paper, we systematically analyze the equations of modern generative diffusion networks to propose a framework, called MDP, that explains the design space of suitable manipulations. We identify 5 different manipulations, including intermediate latent, conditional embedding, cross attention maps, guidance, and predicted noise. We analyze the corresponding parameters of these manipulations and the manipulation schedule. We show that some previous editing methods fit nicely into our framework. Particularly, we identified one specific configuration as a new type of control by manipulating the predicted noise, which can perform higher-quality edits than previous work for a variety of local and global edits.

Manipulations

We test with 5 Manipulations in the design space, namely:
MDP-$x_t$: intermediate latent interpolation.
MDP-$c$: conditional embedding interpolation.
P2P (Prompt-to-Prompt): cross attention manipulation.
MDP-$\beta$: guidance.
MDP-$\epsilon_t$: predicted noise interpolation.
Comparisons between different manipulations can be found in our paper. In this website, we provide the comparisons between P2P and our highlighted manipulation MDP-$\epsilon_t$.

MDP-$\epsilon_t$

We use an example to demonstrate the idea of MDP-$\epsilon_t$: predicted noise interpolation. The top branch is inverted from a real image given condition $\mathbf{c}^{(A)}$ "Photo of a rabbit on the grass". The bottom branch is generated using condition $\mathbf{c}^{(B)}$ "Photo of a rabbit in a library". We copy the predicted noise from step $t_{max}$ to $t_{min}$ of the top branch, then use $\mathbf{c}^{(B)}$ to denoise and generate the images in the middle branch.

Qualitative results

Local editing

Results of changing object(s) comparing Prompt-to-Prompt and our method.

Results of adding object(s) comparing Prompt-to-Prompt and our method.

Results of changing attribute comparing Prompt-to-Prompt and our method.

Results of removing object(s) comparing Prompt-to-Prompt and our method.

Results of mixing objects comparing Prompt-to-Prompt and our method.

Global editing

Results of changing background comparing Prompt-to-Prompt and our method.

Results of in-domain transfer comparing Prompt-to-Prompt and our method.

Results of out-domain transfer comparing Prompt-to-Prompt and our method.

Results of stylization comparing Prompt-to-Prompt and our method.

Acknowledgments

All the input images tested in the paper are real world images from either Unsplash, Flickr or COCO dataset. We base this website off of the EG3D website template.