reproducibilityindex.ai

Intriguing Properties of Generative Classifiers

Authors: Priyank Jaini, Kevin Clark, Robert Geirhos

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We here investigate perceptual properties of generative classiﬁers, i.e., models trained to generate images from which we extract zero-shot classiﬁcation decisions. We focus on two of the most successful types of text-to-image generative models diffusion models and autoregressive models and compare them to both discriminative models (e.g., Conv Nets, vision transformers, CLIP) and human psychophysical data. Speciﬁcally, we focus on the task of visual object recognition (also known as classiﬁcation) of challenging out-of-distribution datasets and visual illusions. Concretely, in this work, we study the properties of generative classiﬁers based on three different text-to-image generative models: Stable Diffusion (SD), Imagen, and Parti on 17 challenging OOD generalization datasets from the model-vs-humans toolbox (Geirhos et al., 2021). We compare the performance of these generative classiﬁers with 52 discriminative models and human psychophysical data.
Researcher Affiliation	Industry	Priyank Jaini Google Deep Mind Kevin Clark Google Deep Mind Robert Geirhos Google Deep Mind
Pseudocode	Yes	Algorithm 1 Classiﬁcation using diffusion models.
Open Source Code	No	The paper includes a code snippet (Listing 1) as an example of diffusion noise augmentation but does not state that the source code for their full methodology is open-sourced or provide a link to a repository for it.
Open Datasets	Yes	We study the performance of these generative classiﬁers on 17 challenging out-of-distribution (OOD) datasets proposed in the model-vs-human toolbox (Geirhos et al., 2021). Of these 17 datasets, ﬁve correspond to a non-parametric single manipulation... The other twelve datasets consist of parametric image distortions... We trained a standard Res Net-50 on Image Net-1K (Russakovsky et al., 2015) by adding diffusion-style noise as a data augmentation during both training and evaluation.
Dataset Splits	No	The paper does not provide specific percentages or counts for training, validation, and test splits for the models they are evaluating (Imagen, SD, Parti) or the ResNet-50 they fine-tuned. It implies testing is done on the OOD datasets, but no explicit validation split details are given.
Hardware Specification	No	The paper does not specify the hardware (e.g., CPU, GPU models, memory) used for training or evaluating the models.
Software Dependencies	No	The paper mentions 'JAX Image Net training' and includes 'import jax' and 'import tensorflow as tf' in a code snippet, but it does not specify version numbers for these or any other software components.
Experiment Setup	Yes	Preprocessing: We preprocess the 17 datasets in the model-vs-human toolbox by resizing the images to 64 64 resolution for Imagen, 256 256 for Parti, and 512 512 for SD since these are the resolutions for the each of the base models respectively. We use the prompt, A bad photo of a yk, for each dataset and every model. We follow the exact experiment setting here as in Clark & Jaini (2023) for Imagen and Stable Diffusion to obtain classiﬁcation decisions. Speciﬁcally, we use the heuristic weighting function wt := exp( 7t) in Equation (2) to aggregate scores across multiple time steps. We use a single prompt for each image instead of an ensemble of prompts as used in CLIP to keep the experiments simple. Loss function: We use the L2 loss function for diffusion-based models since it approximates the diffusion variational lower bound (see Equation (2)) and thus results in a Bayesian classiﬁer. Furthermore, both Stable Diffusion and Imagen are trained with the L2 loss objective. We trained a Res Net-50 in exactly the same way as for standard, 90 epoch JAX Image Net training with the key difference that we added diffusion noise as described by the code below. Since this makes the training task substantially more challenging, we trained the model for 300 instead of 90 epochs. The learning rate was 0.1 with a cosine learning rate schedule, 5 warmup epochs, SGD momentum of 0.9, weight decay of 0.0001, and a per device batch size of 64. For diffusion style denoising we used a ﬂag named sqrt alphas which ensures that the noise applied doesn t completely destroy the image information in most cases. The input to the Add Noise method is in the [0, 1] range; the output of the Add Noise method exceeds this bound due to the noise; we did not normalize / clip it afterwards but instead directly fed this into the network. We did not perform Image Net mean/std normalization. The training augmentations we used were 1. random resized crop, 2. random horizontal ﬂip, 3. add diffusion noise.