Text-to-Image Diffusion Models are Zero Shot Classifiers

Authors: Kevin Clark, Priyank Jaini

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models knowledge and comparing them with CLIP s zero-shot abilities. They perform competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, they achieve state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. ... In this paper, we investigate these questions by transferring Imagen and Stable Diffusion (SD) to discriminative tasks.
Researcher Affiliation Industry Kevin Clark Google Deep Mind Toronto kevclark@google.com Priyank Jaini Google Deep Mind Toronto pjaini@google.com
Pseudocode Yes Algorithm 1 Diffusion model classification with pruning.
Open Source Code No The paper does not provide an explicit statement about releasing the source code for the described methodology, nor does it include a link to a code repository.
Open Datasets Yes We consider 13 datasets from Radford et al. (2021) as reported in Table 1. ... We compare Imagen and SD with CLIP, the recently proposed Vi T-22B model (Dehghani et al., 2023) which was trained on JFT (Sun et al., 2017) extended to 4B images (Zhai et al., 2022) and fine-tuned on Imagenet, and a (not zero-shot) supervised Res Net50 model trained on the training set. ... We use synthetic images similar to Lewis et al. (2022), where images are generated based on the CLEVR (Johnson et al., 2017) visual question answering dataset.
Dataset Splits No The paper mentions using 'reduced-size datasets (4096 examples)' for experiments and discusses 'train' and 'test' scenarios implicitly. However, it does not specify explicit validation splits (e.g., percentages or sample counts for validation sets) or describe a cross-validation setup for reproducibility.
Hardware Specification No The paper does not specify the hardware (e.g., GPU models, CPU types, or cloud computing instances with specifications) used to run the experiments.
Software Dependencies No The paper mentions using specific models like 'Stable Diffusion v1.4' and 'frozen T5... language encoder' or 'pre-trained text encoder from CLIP', but it does not list specific software dependencies with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, CUDA 11.x) for replication.
Experiment Setup Yes We preprocess each dataset by performing a central crop and then resizing the images to 64x64 resolution for Imagen, 512x512 for SD, and 224x224 for CLIP. We use min_scores = 20, max_scores = 2000, and cutoff_pval = 2 e 3. We use a single prompt for each image.