eDiff-I: Text-to-Image Diffusion Models with Ensemble of Expert Denoisers

TL;DR: eDiff-I is a new generation of generative AI content creation tool that offers unprecedented text-to-image synthesis with instant style transfer and intuitive painting with words capabilities.

We propose eDiff-I, a diffusion model for synthesizing images given text. Motivated by the empirical observation that the behavior of diffusion models differ at different stages of sampling, we propose to train an ensemble of expert denoising networks, each specializing for a specific noise interval. Our model is conditioned on the T5 text embeddings, CLIP image embeddings and CLIP text embeddings. Our approach can generate photorealistic images correponding to any input text prompt. In addition to text-to-image synthesis, we present two additional capabilies - (1) style transfer, which enables us to control the style of the generated sample using a reference style image, and (2) "Paint with words" - an application where the user can generate images by painting segmentation maps on canvas, which is very handy for crafting the desired image in mind.

A highly detailed digital painting of a portal in a mystic forest with many beautiful trees. A person is standing in front of the portal

A highly detailed zoomed-in digital painting of a cat dressed as a witch wearing a wizard hat in a haunted house, artstation

An image of a beautiful landscape of an ocean. There is a huge rock in the middle of the ocean. There is a mountain in the background. Sun is setting.

Style Reference

A photo of a duckling wearing a medieval soldier helmet and riding a skateboard.

A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed.