Fine-Grained Semantically Aligned Vision-Language Pre-Training
Authors: Juncheng Li, XIN HE, Longhui Wei, Long Qian, Linchao Zhu, Lingxi Xie, Yueting Zhuang, Qi Tian, Siliang Tang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that LOUPE achieves state-of-the-art performance on a variety of vision-language tasks. |
| Researcher Affiliation | Collaboration | 1 Zhejiang University, 2 Huawei Cloud |
| Pseudocode | No | The paper describes its methods in detail through text and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The repository of this work is at https://github. com/YYJMJC/LOUPE. |
| Open Datasets | Yes | We compare LOUPE on the widely used MSCOCO [27] and Flickr30K [33] datasets. We compare LOUPE with CLIP on 11 downstream classification datasets... For object detection, we evaluate their mean Average Precision (m AP) at Io U thresholds of {0.3, 0.5} on COCO [27] (65 classes) and PASCAL VOC [11] (20 classes). For visual grounding, we evaluate their top-1 accuracy at Io U thresholds of 0.5 on Ref COCO [51] and Ref COCO+ [51]. |
| Dataset Splits | Yes | For visual grounding, we evaluate their top-1 accuracy at Io U thresholds of 0.5 on Ref COCO [51] and Ref COCO+ [51]. The experiment details of CLIP variants and LOUPE are provided in Appendix E. (Table 3 shows 'val test A test B' columns for Ref COCO, indicating a validation set was used for evaluation/reporting. Appendix E also states 'We follow the official split for each dataset and report the standard metrics.') |
| Hardware Specification | Yes | We pre-train the model for 20 epochs using a batch size of 512 on 128 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions the use of specific models (Swin-L, BERT-Small) and an optimizer (AdamW), but does not provide specific version numbers for general software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA libraries. |
| Experiment Setup | Yes | We pre-train the model for 20 epochs using a batch size of 512 on 128 NVIDIA V100 GPUs. We utilize Adam W [29] optimizer with a learning rate of 2 10 4 and a weight decay of 0.01. |