HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention
Authors: Shijie Geng, Jianbo Yuan, Yu Tian, Yuxiao Chen, Yongfeng Zhang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the advantages of Hi CLIP, we conduct qualitative analysis on its unsupervised hierarchy induction during inference, as well as extensive quantitative experiments on both visual recognition and vision-language downstream tasks. |
| Researcher Affiliation | Collaboration | Shijie Geng1,2 , Jianbo Yuan2, Yu Tian2, Yuxiao Chen1, Yongfeng Zhang1 1Rutgers University 2Byte Dance Inc. {sg1309, yc984, yongfeng.zhang}@rutgers.edu, {jianbo.yuan, yutian.yt}@bytedance.com |
| Pseudocode | Yes | Algorithm 1 Unsupervised hierarchy induction for input images |
| Open Source Code | Yes | 1We release our implementation of Hi CLIP at https://github.com/jeykigung/Hi CLIP. |
| Open Datasets | Yes | To make a fair comparison with the state-of-the-art contrastive vision-language pretraining approaches, we adopt the YFCC15M benchmark proposed in (Cui et al., 2022) which builds on a subset from YFCC100M (Thomee et al., 2016) consisting of 15M image-text pairs. In addition, we construct a 30M version of pretraining data by including Conceptual Caption 3M (CC3M) (Sharma et al., 2018) and 12M (CC12M) (Changpinyo et al., 2021). |
| Dataset Splits | Yes | In Table 2, we compare different CLIP-style methods on downstream vision-language tasks, including image-text retrieval which emphasizes on cross-modal alignment and two vision-language reasoning tasks (VQA and SNLI-VE) which focus more on collaborative multimodal reasoning. ... SNLI (val+test) |
| Hardware Specification | Yes | For the 15M version pretraining data, we set the batch size to 4096 and run all experiments on 32 A100 GPUs. For 30M version pretraining data, we set the batch size to 8192 and run all experiments on 64 A100 GPUs. |
| Software Dependencies | No | Our implementation is based on the open-source Py Torch implementation2. ... We use AdamW optimizer (Loshchilov & Hutter, 2019) ... train a linear classifier with the L-BFGS optimizer from scikit-learn machine learning library. The paper mentions software and libraries but does not provide specific version numbers for them. |
| Experiment Setup | Yes | To make a fair comparison with CLIP family baselines, we train all models for 32 epochs under the same set of pretraining hyperparameters including learning rate, warmup steps, weight decay, etc. The input image size is set to 224 × 224, and the input text sequence length is truncated or padded to 77. The scaling factor σt and σv of Hierarchy-aware attention are both set to 256 for Group Transformer and Tree Transformer. Following CLIP and De CLIP, the learnable temperature parameter τ is initialized as 0.07. ... The learning rate is first linearly increased to 0.001 within 2500 warmup steps, and then decayed to 0 following the cosine strategy. For the 15M version pretraining data, we set the batch size to 4096 ... For 30M version pretraining data, we set the batch size to 8192 ... with a weight decay rate of 0.1 during pretraining. |