Intriguing Properties of Generative Classifiers
Authors: Priyank Jaini, Kevin Clark, Robert Geirhos
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We here investigate perceptual properties of generative classifiers, i.e., models trained to generate images from which we extract zero-shot classification decisions. We focus on two of the most successful types of text-to-image generative models diffusion models and autoregressive models and compare them to both discriminative models (e.g., Conv Nets, vision transformers, CLIP) and human psychophysical data. Specifically, we focus on the task of visual object recognition (also known as classification) of challenging out-of-distribution datasets and visual illusions. Concretely, in this work, we study the properties of generative classifiers based on three different text-to-image generative models: Stable Diffusion (SD), Imagen, and Parti on 17 challenging OOD generalization datasets from the model-vs-humans toolbox (Geirhos et al., 2021). We compare the performance of these generative classifiers with 52 discriminative models and human psychophysical data. |
| Researcher Affiliation | Industry | Priyank Jaini Google Deep Mind Kevin Clark Google Deep Mind Robert Geirhos Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 Classification using diffusion models. |
| Open Source Code | No | The paper includes a code snippet (Listing 1) as an example of diffusion noise augmentation but does not state that the source code for their full methodology is open-sourced or provide a link to a repository for it. |
| Open Datasets | Yes | We study the performance of these generative classifiers on 17 challenging out-of-distribution (OOD) datasets proposed in the model-vs-human toolbox (Geirhos et al., 2021). Of these 17 datasets, five correspond to a non-parametric single manipulation... The other twelve datasets consist of parametric image distortions... We trained a standard Res Net-50 on Image Net-1K (Russakovsky et al., 2015) by adding diffusion-style noise as a data augmentation during both training and evaluation. |
| Dataset Splits | No | The paper does not provide specific percentages or counts for training, validation, and test splits for the models they are evaluating (Imagen, SD, Parti) or the ResNet-50 they fine-tuned. It implies testing is done on the OOD datasets, but no explicit validation split details are given. |
| Hardware Specification | No | The paper does not specify the hardware (e.g., CPU, GPU models, memory) used for training or evaluating the models. |
| Software Dependencies | No | The paper mentions 'JAX Image Net training' and includes 'import jax' and 'import tensorflow as tf' in a code snippet, but it does not specify version numbers for these or any other software components. |
| Experiment Setup | Yes | Preprocessing: We preprocess the 17 datasets in the model-vs-human toolbox by resizing the images to 64 64 resolution for Imagen, 256 256 for Parti, and 512 512 for SD since these are the resolutions for the each of the base models respectively. We use the prompt, A bad photo of a yk, for each dataset and every model. We follow the exact experiment setting here as in Clark & Jaini (2023) for Imagen and Stable Diffusion to obtain classification decisions. Specifically, we use the heuristic weighting function wt := exp( 7t) in Equation (2) to aggregate scores across multiple time steps. We use a single prompt for each image instead of an ensemble of prompts as used in CLIP to keep the experiments simple. Loss function: We use the L2 loss function for diffusion-based models since it approximates the diffusion variational lower bound (see Equation (2)) and thus results in a Bayesian classifier. Furthermore, both Stable Diffusion and Imagen are trained with the L2 loss objective. We trained a Res Net-50 in exactly the same way as for standard, 90 epoch JAX Image Net training with the key difference that we added diffusion noise as described by the code below. Since this makes the training task substantially more challenging, we trained the model for 300 instead of 90 epochs. The learning rate was 0.1 with a cosine learning rate schedule, 5 warmup epochs, SGD momentum of 0.9, weight decay of 0.0001, and a per device batch size of 64. For diffusion style denoising we used a flag named sqrt alphas which ensures that the noise applied doesn t completely destroy the image information in most cases. The input to the Add Noise method is in the [0, 1] range; the output of the Add Noise method exceeds this bound due to the noise; we did not normalize / clip it afterwards but instead directly fed this into the network. We did not perform Image Net mean/std normalization. The training augmentations we used were 1. random resized crop, 2. random horizontal flip, 3. add diffusion noise. |