September 7, 2021

Multimodal AI: overview + experiments with DALL-E & CLIP

Juta Staes
Machine Learning Engineer | Squad Lead
No items found.
Subscribe to newsletter
Share this post

In the last recent years, AI has made significant progress in task-specific applications such as text completion and image classification. However, progress on more general understanding across domains has been lacking until recently. With the arrival of multimodal models such as DALL-E and CLIP, it becomes possible to interpret and generate text, images and combinations of both within one model. In this blogpost I will dive into the world of multimodal AI: we’ll see what it is and look at some of the history. Secondly we’ll look at DALL-E & CLIP a bit closer, and at some experiments that we did with these models.


Multimodal AI


First let’s start by explaining what a modality is. As humans we can understand the world around us: we can see objects, hear sounds and read books. Each of these signals can be considered a different modality: images, audio and text. Different modalities are characterized by different statistical properties. For example, images are usually represented by storing pixel values in a 2D map, while text can be represented as a sequence of numerical vectors, where each vector represents one word.

Examples of modalities and their data characteristics.

In order for AI to understand the world around us, it needs to be able to interpret these different modalities and to reason about them at the same time. Multimodal AI does exactly that. It uses different modalities as input/output and combines them in the same model. Some example use cases include: generating a caption for an image, visual question answering or performing emotion detection based on both audio and text. In this blogpost we will look specifically at models that combine text and images. But first let’s take a closer look at the history of multimodal AI across all modalities.


One of the earliest examples of multimodal research is audio-visual speech recognition (AVSR) by Yuhas et al. (1989). It was motivated by the McGurk effect, a phenomenon that demonstrates an interaction between hearing and vision in speech perception. It shows that a sound can be perceived differently when pairing it with a non-matching visual component. Currently AVSR is used mostly when dealing with noisy audio signals because the visual information has proven to be redundant in combination with high quality audio signals.

A second important category of multimodal applications comes from the field of multimedia content indexing and retrieval. This problem was first solved by using keyword-based approaches, but in the late 1990s new techniques were used to query the content directly. Multimodal models were developed that combined video and audio as input to support content based video indexing. Snoek et al. (2003) gives an overview of the state of the art at the time.

A third category of multimodal research was established in the early 2000s around the field of emotion recognition with the goal of understanding human behaviors during social interactions. In 2011 the first audio-visual emotion challenge (AVEC) was organized. Starting from 2012 a lot of progress was made thanks to the strong technical advances in image classification, and later in face detection. A summary of recent progress in multimodal affect recognition was published by D’Mello et al. (2015).

Most recently, a new category of multimodal applications emerged with an emphasis on language and vision: media description and generation. One of the most representative applications is image captioning where the task is to generate a text description of the input image. In 2017 Vaswani et al. introduced transformers as a new model architecture used for NLP applications. Soon after that, transformers were used for vision applications as well. In 2021, OpenAI released two multimodal models (DALL-E & CLIP) combining vision and textual data that use the transformer architecture. Both networks use a large amount of parameters and are trained on huge datasets. They show very impressive results. Given this recent evolution, I think we are only at the start of what multimodal AI has to offer us and it is bound to overwhelm us even more in the future.

Architecture of a transformer using the attention mechanism, source:

Enough about what multimodal AI is and why it is here, let’s now take a look at how it works. In the rest of this blogpost we will take a closer look at these two models: DALL-E & CLIP. We will shortly describe each model, its capabilities and how it was trained. Next to that we’ll present some experiments that we did with both of these models: we trained our own DALL-E model from scratch & used a pretrained CLIP model out of the box and tested it.


Starting with GPT-2, the tone was set to create transformer networks with multi-billion parameters. DALL-E is a generative network with 12 billion parameters that creates images based on textual input. It can generate images from scratch based on a description, but it can also regenerate specific rectangular regions of an existing image in a way that is consistent with the text prompt. To scale up to 12 billion parameters, OpenAI created a dataset of 250 million text-images pairs to train the network.

In order to understand how the network works, I recommend watching Yannic Kilcher’s video where he explains how DALL-E works, or reading the paper. Here I will only explain some of the basics. First you need to know that DALL-E uses a pretrained variational auto encoder (VAE) that maps all images to a 32x32 grid of tokens, where each token is part of a fixed vocabulary of size 8192. This VAE, which can reconstruct the images based on the 32x32 grid, allows the network to reason about the image in a relatively small space. Second, DALL-E combines the 256 BPE-encoded text tokens with these 32x32 tokens (=1024) and models them autoregressively as a single stream of data. This means the transformer is trained at each iteration to generate the next token based on the input text and the already generated tokens. The transformer is a decoder-only model in which each image token can attend to all text tokens in any one of its attention layers and it uses sparse attention to the other image tokens.

Overview of how DALL-E is trained and used to generate images.

As a result, DALL-E is able to generate very impressive images. Its capabilities include combining unrelated concepts in plausible ways (a), creating anthropomorphized versions of animals and objects (b), rendering text (c), and applying transformations to existing images (d). Examples of these capabilities can be seen on the image below.

Examples of DALL-E results, source:

Experiments with DALL-E

At ML6 we try to stay at the forefront of innovation. The release of DALL-E and code to run it presented us with a nice opportunity to experiment with it. The code we used to run for this experiment can be found here: Note that while the code was released, the pre-trained model is not available, thus we needed to train the model from scratch. We trained DALL-E on a dataset that we created from Material Design Icons, where we had 6000 images with a description. We chose to add each icon 8 times for 8 different colors, so we ended up with a dataset of size 48,000. The image resolution was 128x128.

In order to make it work on a smaller dataset, and also because of limited compute resources, the trained networks had to be a lot smaller. One technique is to decrease the input size by using images with smaller resolution, or by limiting the input size of the text tokens. Another technique is to make the networks less deep. First we trained the VAE with a vocabulary size of 512. The DALL-E was then trained with 32 input tokens, and the size of the network was limited.

The results are quite interesting to see. The quality is far from the results presented by OpenAI. This is not surprising given that we used such a small dataset. But we can see that our network learned some concepts and also learned to combine them. In order for it to work better, some more time can be spent on optimizing the network size for small images and by training longer.

Results from custom trained DALL-E.


What we didn’t mention yet about DALL-E is that it actually uses an additional model to rank the resulting images, a model called CLIP. DALL-E can for example generate up to 512 images and only the best 32 will be kept based on the ranking obtained from CLIP.

CLIP is a set of models that generalize for different vision tasks without being explicitly trained for these tasks. Often SOTA models perform well on the task they are trained for, but don’t generalize to different tasks. CLIP aims to resolve this problem. They propose a pre-training task where given an image, the model needs to predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in their dataset. And as with DALL-E, they use a huge dataset, consisting of 400 million (image, text) pairs, to train the model.

In order to solve this task, our intuition is that CLIP models will need to learn to recognize a wide variety of visual concepts in images and associate them with their names. As a result, CLIP models can then be applied, out of the box, to a lot of visual classification tasks. An example of this is shown in the picture below. The training data for CLIP might consist of several pictures of dogs paired with a textual description. If we then want to use CLIP to classify the type of animal, we can do so by including the different types of animal in a description such as “a photo of a dog”. And then we can see with which description the image is more likely to be paired with.


This behavior is called zero shot learning. It refers to recognizing a concept without having explicitly seen any examples of it. The model never saw any labelled images of dogs, and it is however capable of performing the classification task. This makes it possible to use the model out of the box without the need for having a labelled dataset, which is often expensive to create.

Experiments with CLIP

While the DALL-E model was not released, for CLIP the source code and the trained model were made publicly available. So again this presented an opportunity to try it out ourselves. We made an application with code based on the following repository:

Everyone working at ML6 knows how memes are part of our culture, and we have a good laugh sharing and making memes every week. So we present you: the meme finder. An application, made by my colleague Thomas Dehaene, that can find meme templates based on a textual description. We embedded all the meme images with CLIP beforehand, and when someone enters a search query, we will embed the text and return the most similar images of memes. You can try it out yourself at


We took a look at what multimodal AI is, which kind of applications it has and what the most recent advancements are. Especially these last developments make me very excited and curious for what is more to come. We clearly see how the combination of huge compute and huge datasets, paired with the right model, can lead to great and very capable models. Too bad not every ML Engineer has 250 GPUs lying around and a labelled dataset of over 100 million images.

DALL-E & CLIP are both very impressive models, and we can learn a lot from them. In both models it is clearly demonstrated how AI is able to learn across different data types. Images and text have very different properties and yet we are able to model them within the same space. This is something we could not imagine 10 years ago, when we barely were able to classify images. Imagine what AI will be able to do in 10 more years..


Related posts

View all
No results found.
There are no results with this criteria. Try changing your search.
Large Language Model
Foundation Models
Structured Data
Chat GPT
Voice & Sound
Front-End Development
Data Protection & Security
Responsible/ Ethical AI
Hardware & sensors
Generative AI
Natural language processing
Computer vision