Hierarchical Prompt Learning for Compositional Zero-Shot Recognition

Authors: Henan Wang, Muli Yang, Kun Wei, Cheng Deng

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method is evaluated on three Compositional Zero-Shot Learning (CZSL) benchmark datasets, i.e., MITStates [Isola et al., 2015], UT-Zappos [Yu and Grauman, 2014] and C-GQA [Naeem et al., 2021]. We follow the standard split in previous works [Purushwalkam et al., 2019], and the detailed information of each dataset is summarized in Tab. 1. MIT-States is a challenging dataset containing 53,753 everyday images, e.g., young cat and old dog . It is annotated to a variety of classes with 115 state classes and 245 object classes. MIT-States has 1,962 compositions in total under the closed-world setting, in which 1,262 compositions are seen in training and 700 compositions are unseen. UT-Zappos is a fine-grained dataset containing 50,025 images, primarily of various types of shoes, e.g., canvas slippers and rubber sandals , with 12 object classes and 16 state classes, yielding 116 plausible compositions, 83 of which are seen compositions and the rest 33 compositions are unseen. C-GQA is a compositional version of Stanford GQA dataset [Hudson and Manning, 2019], contains 39,298 images in total, including 5,592 seen compositions and 1,932 unseen compositions. It contains 413 state classes and 674 object classes. Evaluation Metrics. We evaluate the performance according to prediction accuracy for recognizing seen and unseen compositions. Following the setting in [Purushwalkam et al., 2019], we compute the accuracy in two situations: 1) Seen, testing only on seen compositions; 2) Unseen, testing only on unseen compositions. Based on these, we can compute 3) Harmonic Mean (HM) of the two metrics, which balances the performance between seen and unseen accuracy. Eventually, we compute 4) Area Under the Curve (AUC) to quantify the overall performance of both seen and unseen accuracy at different operating points with respect to the bias. Following [Chao et al., 2016], we utilize a calibration bias to trade
Researcher Affiliation Academia Henan Wang , Muli Yang , Kun Wei and Cheng Deng School of Electronic Engineering, Xidian University, Xi an, China {nnhhwang, muliyang.xd, weikunsk, chdeng.xd}@gmail.com
Pseudocode No No pseudocode or algorithm blocks were found.
Open Source Code No No explicit statement of code release or repository link for the described methodology was found.
Open Datasets Yes Our method is evaluated on three Compositional Zero-Shot Learning (CZSL) benchmark datasets, i.e., MITStates [Isola et al., 2015], UT-Zappos [Yu and Grauman, 2014] and C-GQA [Naeem et al., 2021].
Dataset Splits Yes Train Set Val Set Test Set Dataset State Object Pairs Images Pairs (S/U) Images Pairs (S/U) Images MIT-States 115 245 1,262 30,338 300/300 10,420 400/400 12,995 UT-Zappos 16 12 83 22,998 15/15 3,214 18/18 2,914 C-GQA 413 674 5,592 26,920 1,252/1,040 7,280 888/923 5,098 Table 1: Dataset details with respect to state/object numbers, pair/image numbers in seen/unseen (S/U) splits, and in val/test sets.
Hardware Specification Yes All of our experiments were conducted on an NVIDIA RTX A6000 GPU.
Software Dependencies Yes Our model is implemented with Py Torch [Paszke et al., 2019] and optimized by Adam [Kingma and Ba, 2014] optimizer with the learning rate set to 5e 05, 5e 04, 5e 05 for MIT-State, UT-Zappos, C-GQA, respectively. The weight decay is respectively 1e 05, 1e 05, 5e 05 for the datasets mentioned above. The batch size is set to 128 for all three datasets.
Experiment Setup Yes Our model is implemented with Py Torch [Paszke et al., 2019] and optimized by Adam [Kingma and Ba, 2014] optimizer with the learning rate set to 5e 05, 5e 04, 5e 05 for MIT-State, UT-Zappos, C-GQA, respectively. The weight decay is respectively 1e 05, 1e 05, 5e 05 for the datasets mentioned above. The batch size is set to 128 for all three datasets.