Fine-Grained Semantically Aligned Vision-Language Pre-Training

Authors: Juncheng Li, XIN HE, Longhui Wei, Long Qian, Linchao Zhu, Lingxi Xie, Yueting Zhuang, Qi Tian, Siliang Tang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that LOUPE achieves state-of-the-art performance on a variety of vision-language tasks.
Researcher Affiliation Collaboration 1 Zhejiang University, 2 Huawei Cloud
Pseudocode No The paper describes its methods in detail through text and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The repository of this work is at https://github. com/YYJMJC/LOUPE.
Open Datasets Yes We compare LOUPE on the widely used MSCOCO [27] and Flickr30K [33] datasets. We compare LOUPE with CLIP on 11 downstream classification datasets... For object detection, we evaluate their mean Average Precision (m AP) at Io U thresholds of {0.3, 0.5} on COCO [27] (65 classes) and PASCAL VOC [11] (20 classes). For visual grounding, we evaluate their top-1 accuracy at Io U thresholds of 0.5 on Ref COCO [51] and Ref COCO+ [51].
Dataset Splits Yes For visual grounding, we evaluate their top-1 accuracy at Io U thresholds of 0.5 on Ref COCO [51] and Ref COCO+ [51]. The experiment details of CLIP variants and LOUPE are provided in Appendix E. (Table 3 shows 'val test A test B' columns for Ref COCO, indicating a validation set was used for evaluation/reporting. Appendix E also states 'We follow the official split for each dataset and report the standard metrics.')
Hardware Specification Yes We pre-train the model for 20 epochs using a batch size of 512 on 128 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions the use of specific models (Swin-L, BERT-Small) and an optimizer (AdamW), but does not provide specific version numbers for general software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA libraries.
Experiment Setup Yes We pre-train the model for 20 epochs using a batch size of 512 on 128 NVIDIA V100 GPUs. We utilize Adam W [29] optimizer with a learning rate of 2 10 4 and a weight decay of 0.01.