Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Authors: Jinyi Hu, Yuan Yao, Chongyi Wang, SHAN WANG, Yinxu Pan, Qianyu Chen, Tianyu Yu, Hanghao Wu, Yue Zhao, Haoye Zhang, Xu Han, Yankai Lin, Jiao Xue, dahai li, Zhiyuan Liu, Maosong Sun
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate that by solely relying on English multimodal data, VISCPM can achieve the SOTA performance among Chinese open-sourced multimodal models. |
| Researcher Affiliation | Collaboration | 1 Tsinghua University 2 Beijing University of Posts and Telecommunications 3 Shanghai Artificial Intelligence Laboratory 4 Renmin University of China 5 Zhihu Inc. 6 Model Best Inc. |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. The training processes are described in narrative text. |
| Open Source Code | Yes | To facilitate future research, we open-source codes and model weights at https://github.com/Open BMB/Vis CPM. |
| Open Datasets | Yes | B.1 PRETRAINING DATASET: COCO (Lin et al., 2014): The COCO dataset is a meticulously compiled image caption dataset... Visual Genome (Krishna et al., 2017)... CC3M (Sharma et al., 2018)... CC12M (Changpinyo et al., 2021)... Laion2B (Schuhmann et al., 2022)... Laion-COCO (Christoph et al., 2022)... Wukong (Gu et al., 2022)... Zero (Xie et al., 2022)... B.2 INSTRUCTION TUNING DATASET: LLa VA-Instruct-150K (Liu et al., 2023a)... M3IT (Li et al., 2023b)... Uni MM-Chat (Yu et al., 2023a)... |
| Dataset Splits | No | The paper uses various datasets for pretraining and instruction tuning, and specific benchmarks for evaluation (e.g., LLa VA Test Set), but does not provide explicit train/validation/test splits with percentages or counts for any single dataset that would allow reproduction of the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud computing instance types used for running the experiments. It only generally refers to 'GPUs' in some contexts, but not as the experimental setup specification. |
| Software Dependencies | No | The paper mentions various software components and models (e.g., CPM-Bee, Stable Diffusion, GPT-4, PyTorch, CUDA, etc.) but does not specify their version numbers, which is necessary for reproducible software dependencies. |
| Experiment Setup | Yes | We train VISCPM-Chat and VISCPM-Chat+ for 180K and 480K steps, respectively, with a batch size of 768 and a learning rate of 1e-5. The instruction tuning lasts for 80K steps with a batch size of 64. |