Language Is Not All You Need: Aligning Perception with Language Models

Authors: Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Nils Bjorck, Vishrav Chaudhary, Subhojit Som, XIA SONG, Furu Wei

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that KOSMOS-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions).
Researcher Affiliation Industry Shaohan Huang , Li Dong , Wenhui Wang , Yaru Hao , Saksham Singhal , Shuming Ma Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, Furu Wei Microsoft https://github.com/microsoft/unilm
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Microsoft https://github.com/microsoft/unilm
Open Datasets Yes The models are trained on web-scale multimodal corpora. The training datasets consist of text corpora, image-caption pairs, and interleaved data of images and texts. Text Corpora We train our model with The Pile [16] and Common Crawl (CC). ... Image-Caption Pairs The image-caption pairs are constructed from several datasets, including English LAION-2B [19], LAION-400M [20], COYO-700M [21], and Conceptual Captions [22, 23].
Dataset Splits No The paper does not explicitly provide specific training/validation/test splits or mention a distinct validation set with percentages or counts. It mentions 'test split' for evaluation but not 'validation'.
Hardware Specification No No specific hardware details such as GPU or CPU models, or memory specifications, were mentioned for running experiments. The paper states, 'We train KOSMOS-1 with 1.6 billion parameters'.
Software Dependencies No The implementation is based on the library Torch Scale [13], which is designed for large-scale model training. Compared with the standard Transformer architecture, we include the following modifications: We use MAGNETO [14], a Transformer variant, as the backbone architecture and XPOS [15] relative position encoding for better long-context modeling. ... use Sentence Piece for tokenization. While software components are mentioned, specific version numbers for these dependencies are not provided.
Experiment Setup Yes We train KOSMOS-1 with 1.6 billion parameters using a mix of text corpora, image-caption pairs, and interleaved data. We use Magneto s initialization for optimization stability and a pretrained CLIP Vi T-L/14 model for image representation. The model is trained for 300k steps using a batch size of 1.2 million tokens and the Adam W optimizer. We adopt a learning rate warm-up and decay schedule, and use Sentence Piece for tokenization.