GLIPv2: Unifying Localization and Vision-Language Understanding

Authors: Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near So TA performance on various localization and understanding tasks.
Researcher Affiliation Collaboration 1University of Washington, 2Meta AI, 3Microsoft, 4UCLA
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code is released at https://github.com/microsoft/GLIP.
Open Datasets Yes GLIPv2-T... is pre-trained on the following data: 1) O365, 2) Gold G as in GLIP-T (C), and 3) Cap4M, 4M image-text pairs collected from the web with boxes generated by GLIP-T [36]. GLIPv2-B/GLIPv2-H... training data contain: 1) Five ODs (2.78M data) 1; 2) Gold G as in MDETR [25]; and 3) CC15M+SBU, 16M public image-text data with generated boxes by GLIP-L [36]. Segmentation heads of GLIPv2 models are pre-trained on COCO, LVIS [20] and Phrase Cut [54], with all other model parameters are frozen.
Dataset Splits Yes For LVIS, we report the numbers for both bbox and segm on minival to avoid data contamination due to the pre-training. For COCO-Det test-dev, * indicates multi-scale evaluation. For LVIS, we report the numbers for both bbox and segm on minival to avoid data contamination due to the pre-training. For Flickr30K test, we report the metric under R@1. For COCO-Mask, we also report both bbox and segm on test-dev.
Hardware Specification No The paper does not specify concrete hardware details such as exact GPU or CPU models used for experiments. It only vaguely mentions 'providing computer resources for large-scale training' in the acknowledgements.
Software Dependencies No The paper mentions software components and architectures like Swin Transformer, BERT-Base, and Dynamic Head, but it does not provide specific version numbers for these or other software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup No The paper states, 'Due to limited space, we refer to supplementary for details of training recipes and hyper-parameters.' Therefore, specific experimental setup details like hyperparameter values are not provided in the main text.