Grounding Multimodal Large Language Models to the World
Authors: Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, Furu Wei
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | KOSMOS-2 is evaluated on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. Experimental results show that KOSMOS-2 achieves not only competitive performance on language and vision-language tasks, but also leading performance on grounding tasks (phrase grounding and referring expression comprehension) and referring tasks (referring expression generation). |
| Researcher Affiliation | Collaboration | 1University of Chinese Academy of Sciences 2Microsoft Research |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code can be found in https://aka.ms/kosmos-2. |
| Open Datasets | Yes | To learn the grounding capability, we first construct a large-scale dataset of Grounded Image-Text pairs (GRIT), based on image-text pairs from a subset of COYO-700M (Byeon et al., 2022) and LAION-2B (Schuhmann et al., 2022). |
| Dataset Splits | Yes | We evaluate phrase grounding task on Flickr30k Entities (Plummer et al., 2015) val and test splits. The model is tested using three well-established datasets: Ref COCO (Yu et al., 2016), Ref COCO+ (Yu et al., 2016) and Ref COCOg (Mao et al., 2015). |
| Hardware Specification | Yes | The model is trained on 256 V100 GPUs for 24 hours. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers). |
| Experiment Setup | Yes | The training procedure involves a batch size of 419K tokens, consisting of 185K tokens from text corpora, 215K tokens from original and grounded image-caption pairs, and 19K tokens from interleaved data. The model is trained for 60K steps, utilizing approximately 25 billion tokens, using an Adam W optimizer with β = (0.9, 0.98), a weight decay of 0.01, and a dropout rate of 0.1. The learning rate increases to 2e-4 during the first 375 warm-up steps and linearly decays to zero. The image resolution is set to 224 224 and the patch size is 14 14. To discretize the continuous coordinates, we divide the width and height of the image into 32 equally sized bins, with each bin encompassing an area of 7 7 pixels. A total of 32 32 location tokens are added to the vocabulary. KOSMOS-2 uses the weights of KOSMOS-1 for initialization, the newly added 32 32 word embeddings of location tokens are initialized randomly. |