CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets
Authors: Zachary Novack, Julian Mcauley, Zachary Chase Lipton, Saurabh Garg
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate CHi LS on a wide array of image classification benchmarks with and without available hierarchical information. These datasets share the property of having an underlying semantic substructure that is not captured in the initial set of class label names. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, University of California San Diego 2Machine Learning Department, Carnegie Mellon University. Correspondence to: Zachary Novack <znovack@ucsd.edu>, Saurabh Garg <sgarg2@andrew.cmu.edu>. |
| Pseudocode | Yes | Algorithm 1 Classification with Hierarchical Label Sets (CHi LS) |
| Open Source Code | Yes | Code is available at: https://github.com/ acmi-lab/CHILS. |
| Open Datasets | Yes | As we are primarily concerned with improving zero-shot CLIP performance in situations with uninforma- tive and/or semantically coarse class labels as described in Section 4, we test our method on the 16 following image benchmarks: the four BREEDS imagenet subsets (Living17, Nonliving26, Entity13, and Entity30) (Santurkar et al., 2021), CIFAR20 (the coarse-label version of CIFAR100; Krizhevsky (2009)), Food-101 (Bossard et al., 2014), Fruits360 (Mures an & Oltean, 2018), Fashion1M (Xiao et al., 2015), Fashion-MNIST (Xiao et al., 2017), LSUN-Scene (Yu et al., 2015), Office31 (Saenko et al., 2010), Office Home (Venkateswara et al., 2017), Object Net (Barbu et al., 2019), Euro SAT (Helber et al., 2019; 2018), and RESISC45 (Cheng et al., 2017). |
| Dataset Splits | Yes | We use the validation sets for each dataset (if present). |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Da Vinci-002' for GPT-3 and 'Vi TL/14@336px backbone' for CLIP, but does not list specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks with their versions). |
| Experiment Setup | Yes | Unless otherwise specified, we use the Vi TL/14@336px backbone (Radford et al., 2021) for our CLIP model, and used Da Vinci-002 (with temperature fixed at 0.7) for all ablations involving GPT-3. For the choice of the prompt embedding function T( ), for each dataset we experiment (where applicable) with two different functions: (1) Using the average text embeddings of the 75 different prompts for each label used for Image Net in Radford et al. (2021), where the prompts cover a wide array of captions and (2) Following the procedure that Radford et al. (2021) puts forth for more specialized datasets, we modify the standard prompt to be of the form A photo of a {}, a type of [context]. , where [context] is dataset-dependent (e.g. food in the case of food-101). In the case that a custom prompt set exists for a dataset, as is the case with multiple datasets that the present work shares with Radford et al. (2021), we use the given prompt set for the latter option rather than building it from scratch. For each dataset, we use the prompt set that gives us the best baseline (i.e. superclass) zero-shot performance. More details are in Appendix C. |