Baptiste Collé
Roberto Gheda
Karol Dobiczek
As text-to-image generative models have gained mainstream attention, many have started to use them for tasks which need specific results. Many have also noticed the careful considerations that have to be taken into account when constructing the prompt passed to the model, often time requiring lengthy prompts with many specific instructions. Sometimes even this is not enough and due to the non-deterministic nature of the basic models and the wanted image unlikely to be generated. In cases like these a more defined control scheme would greatly improve the user experience and reduce the time needed to generate a satisfactory result.
One of the more prominent techniques that solves this issue, giving the user more control than just the textual prompt is ControlNet. In short ControlNet inserts trainable convolution layers between the layers of a pre-trained diffusion model (paper). This allows us to train or fine-tune a model with more than just the text prompt. Attempts already described in the original ControlNet paper describe include models trained on numerous types of condition images: sketches, image contours, pose skeletons, segmentation maps, normal maps and depth maps. While most of these condition image types can be hard to manipulate - normal and depth maps are computer generated, and image contours might require strong artistic skills, segmentation maps seem like a good choice for solving this problem. They assign a mask for each identified object in the picture, this means that they capture a lot of the relevant information about the image contents, while being easy to manipulate, a user would only need to change the masks’ shapes.
As with many machine learning applications a problem that arises when constructing an application is the data. Here the problem is twofold: 1) segmentation map datasets used for current conditioning models such as ADE20K (link) and COCO-stuff (link) are usually manually annotated, this process of annotation can be costly, 2) users who want to use the application need a segmentation map of their image, this segmentation map should be segmented in a similar way to the training data to give satisfactory results. A novel image segmentation tool that can solve both of those issues is the Segment Anything model (SAM), a foundation model able to generate segmentation maps using textual, coordinate or mask prompts.
In this project we aim to use SAM to generate a conditioning dataset using which we fine-tune a pre-trained segmentation conditioning model. We aim for this model to allow us to generate pictures using conditioning segmentation maps automatically generated by the SAM model. With our research we aim to answer the questions “Is it possible to use ControlNet to successfully condition stable diffusion?” and “What is the impact of conditioning using Segment Anything on Stable Diffusion?”.
Stable Diffusion (link) is a text to image model. Based on a textual prompt, the model outputs an image corresponding to the prompt. However this technique is quite difficult to control. This has even lead to the creation of the field of prompt engineering. Furthermore, natural language cannot encode all the information about an image, there is not one to one mapping between an image and a text description. For example, with Stable Diffusion, it is extremely complicated to encode positional information. We cannot easily control the visual composition of an image. Thus, we need an external network to guide the diffusion process. We decided to focus on segmentation maps as an additional constraint for the image generation.
Figure 1. The diffusion process of a Stable Diffusion model.
Figure 1 shows the diffusion process in action. The main architecture of Stable Diffusion is based on a large autoencoder (U-Net - paper). The network predicts the noise in an image at different time steps in the diffusion process. With this ability we can reverse it and go from noise to a image by iteratively removing its noise.
The architecture we used to control Stable Diffusion is a technique called ControlNet (paper). This technique relies on LoRA (paper). LoRA (Low-Rank Adaptation) creates small matrices that are added to the weights of Stable Diffusion’s U-Net. This is illustrated in Figure 2. This is learned during training by giving a prompt, a conditioning image and the expected output.