reproducibilityindex.ai

Text-to-Image Diffusion Models are Zero Shot Classifiers

Authors: Kevin Clark, Priyank Jaini

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate diffusion models by proposing a method for evaluating them as zero-shot classiﬁers. We apply our method to Stable Diffusion and Imagen, using it to probe ﬁne-grained aspects of the models knowledge and comparing them with CLIP s zero-shot abilities. They perform competitively with CLIP on a wide range of zero-shot image classiﬁcation datasets. Additionally, they achieve state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. ... In this paper, we investigate these questions by transferring Imagen and Stable Diffusion (SD) to discriminative tasks.
Researcher Affiliation	Industry	Kevin Clark Google Deep Mind Toronto kevclark@google.com Priyank Jaini Google Deep Mind Toronto pjaini@google.com
Pseudocode	Yes	Algorithm 1 Diffusion model classiﬁcation with pruning.
Open Source Code	No	The paper does not provide an explicit statement about releasing the source code for the described methodology, nor does it include a link to a code repository.
Open Datasets	Yes	We consider 13 datasets from Radford et al. (2021) as reported in Table 1. ... We compare Imagen and SD with CLIP, the recently proposed Vi T-22B model (Dehghani et al., 2023) which was trained on JFT (Sun et al., 2017) extended to 4B images (Zhai et al., 2022) and ﬁne-tuned on Imagenet, and a (not zero-shot) supervised Res Net50 model trained on the training set. ... We use synthetic images similar to Lewis et al. (2022), where images are generated based on the CLEVR (Johnson et al., 2017) visual question answering dataset.
Dataset Splits	No	The paper mentions using 'reduced-size datasets (4096 examples)' for experiments and discusses 'train' and 'test' scenarios implicitly. However, it does not specify explicit validation splits (e.g., percentages or sample counts for validation sets) or describe a cross-validation setup for reproducibility.
Hardware Specification	No	The paper does not specify the hardware (e.g., GPU models, CPU types, or cloud computing instances with specifications) used to run the experiments.
Software Dependencies	No	The paper mentions using specific models like 'Stable Diffusion v1.4' and 'frozen T5... language encoder' or 'pre-trained text encoder from CLIP', but it does not list specific software dependencies with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, CUDA 11.x) for replication.
Experiment Setup	Yes	We preprocess each dataset by performing a central crop and then resizing the images to 64x64 resolution for Imagen, 512x512 for SD, and 224x224 for CLIP. We use min_scores = 20, max_scores = 2000, and cutoﬀ_pval = 2 e 3. We use a single prompt for each image.