In this blogpost, we’ll examine the applications of using VQGAN combined with Discrete Absorbing Diffusion models from the amazing paper:
Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes.
If you are curious about the clever techniques that made this possible, the paper itself and our technical blogpost will definitely be something for you! In the technical blogpost we gave an overview of the previous SOTA generative models, to then arrive at a new class of models, the diffusion models. We went in depth on the architecture of the Discrete Absorbing Diffusion models. In this blogpost, we will take a step back and look how we can leverage the characteristics of this model for creative purposes.
In the technical blogpostI we discussed two models:
In the technical blogpost we saw that a discrete diffusion model doesn’t generate these latent codes from left-to-right, such as an autoregressive model, but can generate them in a random order.
This out-of-order and bidirectional sampling allows us to use novel techniques to edit and generate images:
As in most generative models, we can generate images from scratch. The diffusion model starts with an empty 1I6x16 grid, denoted by MASK tokens, and iteratively generates tokens to fill up the grid. Once every code is generated, the VQGAN decoder uses these generated codes to create globally coherent images competitive with other generative models.
Now let’s talk about conditional image generation. This is a fancy way to say that we choose a part of the image, to steer the generation of the rest of the image.
With the existing autoregressive models, it is possible to use conditional image generation based on a partial image. The idea here is to give the model the top part of an image as context and then further complete the content by predicting the next pixels one-by-one in a top-to-bottom, left-to-right fashion. Instead of predicting pixels, we could also predict discrete codes in a unidirectional manner, which is an idea that was explored in the paper: Taming transformers for high-resolution image synthesis.
Impressive, right? However, it is only possible to condition on the top part of an image, due to the unidirectional nature of autoregressive models.
That’s where our diffusion model can do better! We can encode an image with the VQGAN encoder, giving us a grid of latent variables. Then we choose which parts of the image we want to keep and let the model generate everything around it.
This is a big improvement in flexibility compared to autoregressive models since we are no longer restricted to specific parts of an image. As a demonstration, in the following figure we want to keep the tower in the middle with different generated backgrounds. We keep the codes in the middle region 🏯, replace the other codes with the MASK value and ask the model to generate multiple new images.
As you can see, the content and structure of the middle region stays fixed for the most part, only adapting slightly to better fit the surrounding pixels. Feels great to tell the model what to do 😎!
Similar to the conditional image generation, we can also perform image inpainting 🎨🖌.
Let’s suppose we generate an image and we like most of it, except for one region that feels just not quite right. No problem, we can simply mask out the latent codes from the grid that correspond to this unwanted region and let the diffusion model regenerate them. To show how this works, let’s generate some new mouths for! By regenerating these codes multiple times, we can get many variations.
Very cool! We get 9 new images with exactly the same hair, eyes and background but with completely different mouths 👄 .
Now the real fun can start. Due to the grid structure of the VQGANs latent space, the codes learned by the VQGAN are highly spatially correlated to the content of the generated images. This means that the latent codes that correspond to the region of the eyes of a generated face will contain information about the eyes.
Okay, but now what? Let’s take the latent codes of the 👀 of image A and codes of the 👄 of image B. We now paste them in a grid and mask out all the other tokens and have our diffusion model do what it’s best at.
We see that the diffusion model has nicely filled in the masked regions to create a coherent face while staying true to the original look of the mouth and eyes.
Let’s get creative and apply this idea to a model trained on churches. The pope wants you to build a new church and particularly loves the base of the famous Notre Dame in Paris and the tower from the magnificent Sint-Baafs cathedral in Ghent. No problem, you can just take out the codes corresponding to wanted regions and paste them onto an empty latent space.
All that is left is to ask the diffusion model to fill in the empty regions and decode them with VQ-VAE and you can easily generate an endless amount of new churches that comply with the constraints.
As you can see, the content of the towers and the base stays the same and the model realistically fills in the rest. The pope is very happy with the results and gives you a VIP ticket to skip the line at Saint Peter’s gates. Job well done! If you want some crazier results, you can adjust the sampling temperature of the diffusion model to get some more variation (while trading off some global consistency).
The last cool application of this model is that it allows us to generate images that are larger than the images the model was trained on. This is accomplished by dividing the latent space of the larger image into multiple overlapping grids that match the original 16x16 shape. At each prediction step we compute the probabilities of new tokens and aggregate them across the different grids.
This trick allows us to generate globally consistent images, even though the model was never trained for it.
In this blogpost we have discussed how the bidirectional and iterative nature of the recently emerging diffusion models, combined with the discrete representations of VQGANs and the long-range modelling capabilities of transformers allows us to have more control over the latent space. This architecture produces high-quality and consistent images while adding the ability to edit images in a conceptual discrete space.
Keep an eye out, because it’s not the last you’ll see of these diffusion models (in fact, various amazing new papers have been released while writing this post, such as DALL-E 2, ImageGen, Stable Diffusion…). And don’t forget to check out the technical blogpost to learn what happens behind the scenes!🤓