CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets

Authors: Zachary Novack, Julian Mcauley, Zachary Chase Lipton, Saurabh Garg

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate CHi LS on a wide array of image classification benchmarks with and without available hierarchical information. These datasets share the property of having an underlying semantic substructure that is not captured in the initial set of class label names.
Researcher Affiliation Academia 1Department of Computer Science and Engineering, University of California San Diego 2Machine Learning Department, Carnegie Mellon University. Correspondence to: Zachary Novack <znovack@ucsd.edu>, Saurabh Garg <sgarg2@andrew.cmu.edu>.
Pseudocode Yes Algorithm 1 Classification with Hierarchical Label Sets (CHi LS)
Open Source Code Yes Code is available at: https://github.com/ acmi-lab/CHILS.
Open Datasets Yes As we are primarily concerned with improving zero-shot CLIP performance in situations with uninforma- tive and/or semantically coarse class labels as described in Section 4, we test our method on the 16 following image benchmarks: the four BREEDS imagenet subsets (Living17, Nonliving26, Entity13, and Entity30) (Santurkar et al., 2021), CIFAR20 (the coarse-label version of CIFAR100; Krizhevsky (2009)), Food-101 (Bossard et al., 2014), Fruits360 (Mures an & Oltean, 2018), Fashion1M (Xiao et al., 2015), Fashion-MNIST (Xiao et al., 2017), LSUN-Scene (Yu et al., 2015), Office31 (Saenko et al., 2010), Office Home (Venkateswara et al., 2017), Object Net (Barbu et al., 2019), Euro SAT (Helber et al., 2019; 2018), and RESISC45 (Cheng et al., 2017).
Dataset Splits Yes We use the validation sets for each dataset (if present).
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions 'Da Vinci-002' for GPT-3 and 'Vi TL/14@336px backbone' for CLIP, but does not list specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks with their versions).
Experiment Setup Yes Unless otherwise specified, we use the Vi TL/14@336px backbone (Radford et al., 2021) for our CLIP model, and used Da Vinci-002 (with temperature fixed at 0.7) for all ablations involving GPT-3. For the choice of the prompt embedding function T( ), for each dataset we experiment (where applicable) with two different functions: (1) Using the average text embeddings of the 75 different prompts for each label used for Image Net in Radford et al. (2021), where the prompts cover a wide array of captions and (2) Following the procedure that Radford et al. (2021) puts forth for more specialized datasets, we modify the standard prompt to be of the form A photo of a {}, a type of [context]. , where [context] is dataset-dependent (e.g. food in the case of food-101). In the case that a custom prompt set exists for a dataset, as is the case with multiple datasets that the present work shares with Radford et al. (2021), we use the given prompt set for the latter option rather than building it from scratch. For each dataset, we use the prompt set that gives us the best baseline (i.e. superclass) zero-shot performance. More details are in Appendix C.