Grounding Multimodal Large Language Models to the World

Authors: Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, Furu Wei

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental KOSMOS-2 is evaluated on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. Experimental results show that KOSMOS-2 achieves not only competitive performance on language and vision-language tasks, but also leading performance on grounding tasks (phrase grounding and referring expression comprehension) and referring tasks (referring expression generation).
Researcher Affiliation Collaboration 1University of Chinese Academy of Sciences 2Microsoft Research
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code can be found in https://aka.ms/kosmos-2.
Open Datasets Yes To learn the grounding capability, we first construct a large-scale dataset of Grounded Image-Text pairs (GRIT), based on image-text pairs from a subset of COYO-700M (Byeon et al., 2022) and LAION-2B (Schuhmann et al., 2022).
Dataset Splits Yes We evaluate phrase grounding task on Flickr30k Entities (Plummer et al., 2015) val and test splits. The model is tested using three well-established datasets: Ref COCO (Yu et al., 2016), Ref COCO+ (Yu et al., 2016) and Ref COCOg (Mao et al., 2015).
Hardware Specification Yes The model is trained on 256 V100 GPUs for 24 hours.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers).
Experiment Setup Yes The training procedure involves a batch size of 419K tokens, consisting of 185K tokens from text corpora, 215K tokens from original and grounded image-caption pairs, and 19K tokens from interleaved data. The model is trained for 60K steps, utilizing approximately 25 billion tokens, using an Adam W optimizer with β = (0.9, 0.98), a weight decay of 0.01, and a dropout rate of 0.1. The learning rate increases to 2e-4 during the first 375 warm-up steps and linearly decays to zero. The image resolution is set to 224 224 and the patch size is 14 14. To discretize the continuous coordinates, we divide the width and height of the image into 32 equally sized bins, with each bin encompassing an area of 7 7 pixels. A total of 32 32 location tokens are added to the vocabulary. KOSMOS-2 uses the weights of KOSMOS-1 for initialization, the newly added 32 32 word embeddings of location tokens are initialized randomly.