PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining
Authors: Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, Rongrong Ji, Chunhua Shen
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on five downstream tasks demonstrate the effectiveness of the proposed Pyramid CLIP. In particular, with the same amount of 15 million pre-training image-text pairs, Pyramid CLIP exceeds CLIP on Image Net zero-shot classification top-1 accuracy by 10.6%/13.2%/10.0% with Res Net50/Vi T-B32/Vi T-B16 based image encoder respectively. |
| Researcher Affiliation | Collaboration | Yuting Gao1 Jinfeng Liu1,2, Zihan Xu1, Jun Zhang1 Ke Li1 Chunhua Shen3 1Tencent Youtu Lab 2Shanghai Jiaotong University 3Zhejiang University |
| Pseudocode | No | The paper describes the model architecture and training process in text and figures, but it does not provide any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | Our codes may be released in the future. |
| Open Datasets | Yes | We experiment on three different architectures, Pyramid CLIP-Res Net50/Vi TB32/Vi T-B16, according to the choice of image encoder. Their detailed architecture designs follow that of CLIP (6). LAION99M contains 99M image-text pairs with the highest similarity selected from LAION400M (23) according to the similarity scores provided by the producer. ... Table 1: Pre-training datasets. SBU (24) 1M CC3M (25) 3M CC12M (26) 10M YFCC15M-V1 (27) 15M YFCC15M-V2 (10) 15M LAION99M (23) 99M |
| Dataset Splits | No | The paper mentions using specific datasets for pre-training (e.g., YFCC15M-V1) and for downstream evaluations (e.g., ImageNet, MS-COCO). While these are standard benchmarks, the paper does not explicitly state the train/validation/test splits (e.g., specific percentages or sample counts for each split) used for its own experimental runs for either the pre-training or the downstream tasks, other than referring to 'zero-shot' tasks where no training split is used. |
| Hardware Specification | No | The paper states in its ethics checklist that hardware specifications are in Appendix A, but Appendix A is not provided in the submitted document. Therefore, specific hardware details are not available in the provided text. |
| Software Dependencies | No | The paper mentions using a "publicly released T5 model (19)" and "an object-attribute detector pre-trained by Vin VL (16)", but does not provide specific version numbers for these or any other software dependencies needed for reproducibility. |
| Experiment Setup | Yes | For a batch of N image-text pairs ... where τ is a learnable temperature parameter initialized with 0.07 ... where α is the smoothing hyper-parameter set to 0.2 in our experiments. ... where the loss weights λ and µ are both set to 1/3 in our experiments. ... All the experiments are pre-trained for 8 epochs on YFCC15M-V1. ... L = 12 and Ls = 9 in the experiments. |