Want to know how an unicorn reading the newspaper looks? Ask DALL-E


DALL·E 2 is a new AI system that can create realistic images and art from a description in natural language.

DALLE 2 sits at the intersection of deep natural language processing and computer vision generation and is known as a Hierarchical Text-Conditional Image Generation model. The training set is simply pairs of images and their captions, and DALLE’s goal is to train two models: The first is the prior, which is trained on and takes in a text caption, and produces a CLIP image embedding. The second is our decoder, which takes our CLIP image embedding (and optionally a text caption) and produces a learned image. Once trained, the full workflow of inference is as follows:

The Text encoding step:

Our caption is transformed into a CLIP text embedding using a neural network trained on 400 million (image, text) pairs

The Prior step:

Reduces the dimensionality of our CLIP text embedding using PCA

A Transformer (encoder-decoder) with Attention is used to generate an image embedding from the text embedding

The Decoder step:

A diffusion model GAN (generative adversarial network) is used to transform the image embedding into an image

The image is then fed into two CNNs (convolutional neural networks) that upscale the image from 64x64 to 256x256, then finally to 1024x1024

So if we ask DALL-E how an unicorn reading the news looks, this is the answer:


Dalle Unicorns