Dall-E

The only GPT like part in DALLE2 are the CLIP encoders for the ddpm conditioning

Clip is a different AI also trained by OpenAI who's goal is to predict a correlation term between images and text descriptions 2 models we get →1) input → image, output → embedding :: image encoder →2) text→ embedding:: ____ based on these 2 embeeddings,, we calc → how similar they are

How diffusion models work →

say you take an image of a face

add a little bit of random noise to it. Like the static you see on a tv

with just a little bit of noise added, you can still clearly see the original image of a face. Even more than that you would probably pretty easily be able to reconstruct the original image. And an AI would also probably be able to learn to do that pretty easily.

but, if you incrementally add a little bit of noise to an image, eventually you will get something that looks like pure noise. With no trace of the original image.

so the idea behind ddpms is the following: take images, and add small anounts of noise to them, say a 1000 times. Use this as training data to teach an AI to reverse one step of denoising. So the AI only learns to take a somewhere noised image and slightly denoise it, simple enough right? But turns out that if you generate a picture of 100% noise, that isn't even based on anything real, and then apply this denoising AI a bunch of times, eventually, you'll end up with an image that looks real! It really is amazing, and there is a mathematical explanation to that but I won't get into it.

So how does the CLIP model from before tie into this one? Well because we don't want to just generate random images, we want the images to match a text description. So what if we supply the diffusion model with additional info about what we're denoising. Hopefully it should learn to use this info to guide the denoising process. And it does!

The researchers with OpenAI tried a few different things.

They tried feeding text directly into the diffusion model, with a GPT-like architecture.

Remember how we said that CLIP learns to summarize text into 512 numbers such so that it could easily be matched with images? Yup, we can use the CLIP text embedding of the image label as additional info for the DDPM to use when denoising, so that it hopefully learns go use this info to guide the donoising process. And that yielded way superior results to the previous approach.

However, OpenAI researchers found an even better way to do it! Remember how we said that CLIP takes a text caption and turns it into an embedding, and it takes an image and turns it into an embedding, such so that the embeddings of the image and the text could easily be detected as correlated to each other?

So we can take a bunch of existing image and text description pairs, and put both of them through CLIP and get a bunch of matching text and image embeddings. What OpenAI then did was to train a NEW model to predict how the matching IMAGE embedding will look given a TEXT embedding, they call this model the prior. And then THAT embedding was given to the ddpm model as additional info for denoising. And that, yielded the best results.

So to recap:

text caption --- clip based prior ---> predicted matching image embedding

Result = diffusion(random noised image, predicted matching image embedding)

Of course there's more to this like upsampling and the diffusion of the prior itself but I just tried summarizing the gist of it.

So the "cryptic language" is built by the AI to fit the task you give it. With auto-encoder systems the embedding is meant to describe the image such so that it would be reconstructed from the embedding. But with

CLIP the task if to fit enough information in the embedding such so that it could be matched with the appropriate description

. This yields an entirely different learned language, where the model is encouraged to focus less on the exact visual details and instead describe the semantic content of the image in the embedding

The approach to images here is very different from Image GPT. (Though this is not the first time OpenAI has written about this approach -- see the "Image VQ" results from the multi-modal scaling paper.)

In Image GPT, an image is represented as a 1D sequence of pixel colors. The pixel colors are quantized to a palette of size 512, but still represent "raw colors" as opposed to anything more abstract. Each token in the sequence represents 1 pixel.

In DALL-E, an image is represented as a 2D array of tokens from a latent code. There are 8192 possible tokens. Each token in the sequence represents "what's going on" in a roughly 8x8 pixel region (because they use 32x32 codes for 256x256 images).

(Caveat: The mappings from pixels-->tokens and tokens-->pixels are contextual, so a token can influence pixels outside "its" 8x8 region.)

This latent code is analogous to the BPE code used to represent tokens (generally words) for text GPT. Like BPE, the code is defined before doing generative training, and is presumably fixed during generative training. Like BPE, it chunks the "raw" signal (pixels here, characters in BPE) into larger, more meaningful units.

This is like a vocabulary of 8192 "image words." DALL-E "writes" an 32x32 array of these image words, and then a separate network "decodes" this discrete array to a 256x256 array of pixel colors.

Intuitively, this feels closer than Image GPT to mimicking what text GPT does with text. Pixels are way lower-level than words; 8x8 regions with contextual information feel closer to the level of words.

As with BPE, you get a head start over modeling the raw signal. As with BPE, the chunking may ultimately be a limiting factor. Although the chunking process here is differentiable (a neural auto-encoder), so it ought to be adaptable in a way BPE is not.

(Trivia: I'm amused that one of their visuals allows you to ask for images of triangular light bulbs -- the example Yudkowsky used in LOGI to illustrate the internal complexity of superficially atomic concepts.)