Diffusion models beat GAN

https://ar5iv.labs.arxiv.org/html/2307.08702

We note that the diffusion models are like regular convolutional nets in the sense that they do not natively produce a linear feature, instead generating a series of feature maps at various points in the network.

While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which uses a single pre-training stage to address both families of tasks simultaneously.

Diffusion models - prime candidate. SOTA for - image generation, denoising, inpainting, super-resolution, manipulation,

Train by - U-Net :: to iteratively predict and remove noise → resulting model can synthesize high fidelity, diverse, novel images

The U-Net architecture, as a convolution-based architecture, generates a diverse set of feature representations in the form of intermediate feature maps.

→ these embeddings are useful beyond the noise prediction task, as they contain discriminative information and can also be leveraged for classification

We find that with careful feature selection and pooling, diffusion models outperform comparable generative-discriminative methods such as BigBiGAN for classification tasks.

investigate diffusion learning in transfer learning regime, examining their performance on some datasets

Many computer vision tasks can be broadly classified into two families, 1)discriminative and 2) generative. Discriminative learning - model which applies labels to images/parts of images generative learning - generates images Unified learning - achieve both

BigBiGAN much more burdensome to train than other methods; its encoder makes it larger and slower than comparable GANs, and its GAN makes it more expensive than ResNet-based discriminative methods adapt VAE → to do better in → recognition tasks → by learning mid-level patches

Representation learned by one are not suited for the other

Generative models naturally need representations that capture low-level, pixel and texture details which are necessary for high fidelity reconstruction and generation. Discriminative models, on the other hand, primarily rely on high-level information which differentiates objects at a coarse level based not on individual pixel values, but rather on the semantics of the content of the image

DIffusion have → generative covered → but discriminative( classification) →unexplored

One of the main challenges with diffusion models is feature selection. In particular, the selection of noise steps and feature block is not trivial.

Additionally, these feature maps can be quite large, in terms of both spatial resolution and channel depth suggest various classification heads to take the place of the linear classification layer,

For downstream tasks, we choose fine-grained visual classification (FGVC), an appealing area to use unsupervised features due to implied scarcity of data for many FGVC datasets.

In summary, our contributions are as follows:

We demonstrate that diffusion models can be used as unified representation learners, with 26.21 FID (-12.37 vs. BigBiGAN) for unconditional image generation and 61.95% accuracy (+1.15% vs. BigBiGAN) for linear probing on ImageNet.

We present analysis and distill principles for extracting useful feature representations from the diffusion process.

We compare standard linear probing to specialized MLP, CNN, and attention-based heads for leveraging diffusion representations in a classification paradigm.

We analyze the transfer learning properties of diffusion models, with fine-grained visual categorization (FGVC) as a downstream task, on several popular datasets.

We use CKA to compare the various representations learned by diffusion models, both in terms of different layers and diffusion properties, as well as to other architectures and pre-training methods

GAN - class of DNN → capable of generating new images in data distribution → given random latent vector (z Z) as input → trained by → optimizing a min-max game objective

GANs can be class-conditioned, where they generate images given noise and class input, or unconditional, where they generate random images from noise alone.

images can be mapped to GAN latent space → meaning → GAN learns a representation for the image in noise/latent space —>Some of these approaches directly optimize latent vectors to reconstruct the input image

Others→ train encoders → to generate latent vector → corresponding to given input image

Diffusion denoising probabilistic modela(DDPM), aka Diffusion models → likelihood based generative models which learn a denoising Markov chain using variational inference These models enjoy the benefit of having a 1) likelihood-based objective like VAEs as well as 2) high visual sample quality like GANs even on high variability datasets.

Discriminative models → learn to represent images

Early representation learning methods tried training neural network backbones with partially degraded inputs and learn image representation by making the model predict the rest of the information in the actual image. →many approaches have emerged that revolve around a contrastive loss objective, maximizing distance between positive-negative pairs

→ another work tells about things that work wihtout negative samples

Offline clustering and online clustering + multi view augmentation method →to get better representation

DINO caron2021emerging uses self supervised knowledge distillation between various views of an image in Visual Transformers

Unified models - Other methods leverage the unsupervised nature of GANs to learn good image representations →uses reparameterized sampling from the encoder output.

Distinct from GANs, autoencoders are a natural fit for the unified paradigm. ALAE attempts to learn an encoder-generator map to perform both generation and classification

→>PatchVAE improves on the classification performance of VAE kingma2022autoencoding by encouraging the model to learn good mid-level patch representations Gupta_2020_CVPR. →>MAE DBLP:journals/corr/abs-2111-06377 and iBOT zhou2022ibot train an autoencoder via masked image modeling, and several other transformer-based methods have been built under that paradigm assran2022masked; bao2022beit; huang2022contrastive. →>MAGE li2022mage, which uses a variable masking ratio to optimize for both recognition and generation, is the first method to achieve both high-quality unconditional image generation and good classification results.

In this work, we use the guided diffusion (GD) implementation, which uses a U-Net-style architecture with residual blocks for. This implementation improves over the original

ho2020denoising architecture by adding multi-head self-attention at multiple resolutions, scale-shift norm, and using BigGANbrock2018large residual blocks for upsampling and downsampling. We consider each of these 1) residual blocks, 2) residual+attention blocks, and 3) downsampling/upsampling residual blocks as individual blocks.

The two most common methods for evaluating the effectiveness of self-supervised pre-training are linear probing and finetuning,

Linear probing, which learns a batch normalization + linear layer on → top of frozen features, tests the utility of the learned feature representations – it shows —> whether the pre-training learns disentangled representations, and —> whether these feature meaningful semantic correlations.

Finetuning, on the other hand, learns a batch normalization + linear layer but → with no frozen features. In the finetuning regime, we treat →the pre-training method as an expensive weight initialization method, and retrain the entire architecture for classification.

In this paper, we focus more on the representative capacity of the frozen features, which is of particular interest in areas like fine-grained classification and few shot learning, where data may be insufficient for finetuning

Additionally, this allows us to make statements with respect to the utility of the learned features, rather than the learned weights

compare representations both internally (between blocks of the U-Net) as well as externally (between different U-Nets and with other self-supervised learning architectures).

As described previously, we also propose several approaches to deal with the large spatial and channel dimensions of U-Net representations. Naively, we can use a single linear layer with different preliminary pooling, and we show results for various pooling dimensions. Alternatively, we can use a more powerful MLP, CNN, or attention head to address varying aspects of the feature map height, width, and depth. For fairness, we train CNNs, MLPs, and attention heads with comparable parameter counts to our linear layers under the various pooling settings. We show results for such heads, on ImageNet-50, in Figure 1 (right), with full numerical results and model details in the appendix. We note that the attention head performs the best by a fair margin. In Table 4, we try the best-performing attention head on ImageNet (all classes), and find it significantly outperforms the simple linear probe, regardless of pooling. This suggests the classification head is an important mechanism for extracting useful representations from diffusion models, and it could be extended to other generative models.

We note that the early layers tend to have higher similarity in all cases, suggesting that diffusion models likely capture similar low-level details in the first few blocks. Also note the impact of the time step: the representations are very dissimilar at later layers when the representations are computed using images from different noise time steps.

In this paper, we present an approach for using the representations learned by diffusion models for classification tasks. This re-positions diffusion models as potential state-of-the-art unified self-supervised representation learners. We explain best practices for identifying these representations and provide initial guidance for extracting high-utility discriminative embeddings from the diffusion process. We demonstrate promising transfer learning properties and investigate how different datasets require different approaches to feature extraction. We compare the diffusion representations in terms of CKA, both to show what diffusion models learn at different layers as well as how diffusion representations compare to those from other methods.