GLIPv2: Unifying Localization and Vision-Language Understanding
Authors: Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near So TA performance on various localization and understanding tasks. |
| Researcher Affiliation | Collaboration | 1University of Washington, 2Meta AI, 3Microsoft, 4UCLA |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is released at https://github.com/microsoft/GLIP. |
| Open Datasets | Yes | GLIPv2-T... is pre-trained on the following data: 1) O365, 2) Gold G as in GLIP-T (C), and 3) Cap4M, 4M image-text pairs collected from the web with boxes generated by GLIP-T [36]. GLIPv2-B/GLIPv2-H... training data contain: 1) Five ODs (2.78M data) 1; 2) Gold G as in MDETR [25]; and 3) CC15M+SBU, 16M public image-text data with generated boxes by GLIP-L [36]. Segmentation heads of GLIPv2 models are pre-trained on COCO, LVIS [20] and Phrase Cut [54], with all other model parameters are frozen. |
| Dataset Splits | Yes | For LVIS, we report the numbers for both bbox and segm on minival to avoid data contamination due to the pre-training. For COCO-Det test-dev, * indicates multi-scale evaluation. For LVIS, we report the numbers for both bbox and segm on minival to avoid data contamination due to the pre-training. For Flickr30K test, we report the metric under R@1. For COCO-Mask, we also report both bbox and segm on test-dev. |
| Hardware Specification | No | The paper does not specify concrete hardware details such as exact GPU or CPU models used for experiments. It only vaguely mentions 'providing computer resources for large-scale training' in the acknowledgements. |
| Software Dependencies | No | The paper mentions software components and architectures like Swin Transformer, BERT-Base, and Dynamic Head, but it does not provide specific version numbers for these or other software dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | No | The paper states, 'Due to limited space, we refer to supplementary for details of training recipes and hyper-parameters.' Therefore, specific experimental setup details like hyperparameter values are not provided in the main text. |