Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Authors: Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, Furu Wei

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We do quantitative evaluations of KOSMOS-G on Dream Bench (Ruiz et al., 2022) for single-entity subject-driven generation and MS-COCO (Lin et al., 2014) for text-to-image generation. ... We conduct ablation studies to find out the importance of the image decoder aligning and instruction tuning.
Researcher Affiliation Collaboration Xichen Pan1,2 Li Dong1 Shaohan Huang1 Zhiliang Peng1 Wenhu Chen3 Furu Wei1 Microsoft Research1 New York University2 University of Waterloo3
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or a link to open-source code for the KOSMOS-G methodology.
Open Datasets Yes The image-caption pairs are sourced from multiple datasets, including English LAION2B (Schuhmann et al., 2022), LAION-400M (Schuhmann et al., 2021), COYO-700M (Byeon et al., 2022), and Conceptual Captions (Sharma et al., 2018; Changpinyo et al., 2021). ... We use approximately 9M images from the Open Images V7 dataset (Kuznetsova et al., 2020) to construct our compositional generation instruction tuning data. ... Additionally, we leverage the data constructed by (Brooks et al., 2023) for Instruct Pix2Pix to improve KOSMOS-G s image editing capability.
Dataset Splits Yes We do quantitative evaluations of KOSMOS-G on Dream Bench (Ruiz et al., 2022) for single-entity subject-driven generation and MS-COCO (Lin et al., 2014) for text-to-image generation. ... The Dream Bench dataset contains 30 subjects and features 25 prompt templates, resulting in 750 unique prompts... We follow prior work to generate 4 images for each prompt to form the 3000 images for a comprehensive evaluation. ... For the text-to-image generation, We generate images using 30,000 randomly sampled captions from the MS-COCO (2014) validation set.
Hardware Specification Yes The whole training process took around four days with 256 NVIDIA V100 GPUs, i.e., one day for image decoder aligning, and three days for instruction tuning.
Software Dependencies No The paper states 'Our implementation is based on the Torch Scale (Ma et al., 2022) library, which is designed for large-scale model training. Following KOSMOS-1 (Huang et al., 2023), we also use MAGNETO (Wang et al., 2022), a Transformer variant, as the backbone architecture of our MLLM and Aligner Net. ... We use Sentence Piece (Kudo & Richardson, 2018) to tokenize the text.' However, it does not provide specific version numbers for these software components.
Experiment Setup Yes Our implementation is based on the Torch Scale (Ma et al., 2022) library... Multimodal Language Modeling We use a batch size of 1.2 million tokens... The MLLM is trained for 300,000 steps... We adopt the Adam W optimizer with β = (0.9, 0.98). Furthermore, we configure the weight decay at 0.01 and the dropout rate at 0.1. The learning rate is set to escalate to 2e-4... Image Decoder Aligning The Aligner Net undergoes training using a batch size of 3,584 sentences for 300,000 steps, with a maximum learning rate of 1e-3. ... Instruction Tuning The MLLM and Aligner Net are jointly trained with a batch size of 1,024 images, totaling approximately 200 million images over 200,000 steps. The learning rate peaks at 1e-3.