PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Authors: Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, Rongrong Ji, Chunhua Shen

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on five downstream tasks demonstrate the effectiveness of the proposed Pyramid CLIP. In particular, with the same amount of 15 million pre-training image-text pairs, Pyramid CLIP exceeds CLIP on Image Net zero-shot classification top-1 accuracy by 10.6%/13.2%/10.0% with Res Net50/Vi T-B32/Vi T-B16 based image encoder respectively.
Researcher Affiliation Collaboration Yuting Gao1 Jinfeng Liu1,2, Zihan Xu1, Jun Zhang1 Ke Li1 Chunhua Shen3 1Tencent Youtu Lab 2Shanghai Jiaotong University 3Zhejiang University
Pseudocode No The paper describes the model architecture and training process in text and figures, but it does not provide any pseudocode or clearly labeled algorithm blocks.
Open Source Code No Our codes may be released in the future.
Open Datasets Yes We experiment on three different architectures, Pyramid CLIP-Res Net50/Vi TB32/Vi T-B16, according to the choice of image encoder. Their detailed architecture designs follow that of CLIP (6). LAION99M contains 99M image-text pairs with the highest similarity selected from LAION400M (23) according to the similarity scores provided by the producer. ... Table 1: Pre-training datasets. SBU (24) 1M CC3M (25) 3M CC12M (26) 10M YFCC15M-V1 (27) 15M YFCC15M-V2 (10) 15M LAION99M (23) 99M
Dataset Splits No The paper mentions using specific datasets for pre-training (e.g., YFCC15M-V1) and for downstream evaluations (e.g., ImageNet, MS-COCO). While these are standard benchmarks, the paper does not explicitly state the train/validation/test splits (e.g., specific percentages or sample counts for each split) used for its own experimental runs for either the pre-training or the downstream tasks, other than referring to 'zero-shot' tasks where no training split is used.
Hardware Specification No The paper states in its ethics checklist that hardware specifications are in Appendix A, but Appendix A is not provided in the submitted document. Therefore, specific hardware details are not available in the provided text.
Software Dependencies No The paper mentions using a "publicly released T5 model (19)" and "an object-attribute detector pre-trained by Vin VL (16)", but does not provide specific version numbers for these or any other software dependencies needed for reproducibility.
Experiment Setup Yes For a batch of N image-text pairs ... where τ is a learnable temperature parameter initialized with 0.07 ... where α is the smoothing hyper-parameter set to 0.2 in our experiments. ... where the loss weights λ and µ are both set to 1/3 in our experiments. ... All the experiments are pre-trained for 8 epochs on YFCC15M-V1. ... L = 12 and Ls = 9 in the experiments.