Language Is Not All You Need: Aligning Perception with Language Models
Authors: Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Nils Bjorck, Vishrav Chaudhary, Subhojit Som, XIA SONG, Furu Wei
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that KOSMOS-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). |
| Researcher Affiliation | Industry | Shaohan Huang , Li Dong , Wenhui Wang , Yaru Hao , Saksham Singhal , Shuming Ma Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, Furu Wei Microsoft https://github.com/microsoft/unilm |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Microsoft https://github.com/microsoft/unilm |
| Open Datasets | Yes | The models are trained on web-scale multimodal corpora. The training datasets consist of text corpora, image-caption pairs, and interleaved data of images and texts. Text Corpora We train our model with The Pile [16] and Common Crawl (CC). ... Image-Caption Pairs The image-caption pairs are constructed from several datasets, including English LAION-2B [19], LAION-400M [20], COYO-700M [21], and Conceptual Captions [22, 23]. |
| Dataset Splits | No | The paper does not explicitly provide specific training/validation/test splits or mention a distinct validation set with percentages or counts. It mentions 'test split' for evaluation but not 'validation'. |
| Hardware Specification | No | No specific hardware details such as GPU or CPU models, or memory specifications, were mentioned for running experiments. The paper states, 'We train KOSMOS-1 with 1.6 billion parameters'. |
| Software Dependencies | No | The implementation is based on the library Torch Scale [13], which is designed for large-scale model training. Compared with the standard Transformer architecture, we include the following modifications: We use MAGNETO [14], a Transformer variant, as the backbone architecture and XPOS [15] relative position encoding for better long-context modeling. ... use Sentence Piece for tokenization. While software components are mentioned, specific version numbers for these dependencies are not provided. |
| Experiment Setup | Yes | We train KOSMOS-1 with 1.6 billion parameters using a mix of text corpora, image-caption pairs, and interleaved data. We use Magneto s initialization for optimization stability and a pretrained CLIP Vi T-L/14 model for image representation. The model is trained for 300k steps using a batch size of 1.2 million tokens and the Adam W optimizer. We adopt a learning rate warm-up and decay schedule, and use Sentence Piece for tokenization. |