Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

Authors: Jinyi Hu, Yuan Yao, Chongyi Wang, SHAN WANG, Yinxu Pan, Qianyu Chen, Tianyu Yu, Hanghao Wu, Yue Zhao, Haoye Zhang, Xu Han, Yankai Lin, Jiao Xue, dahai li, Zhiyuan Liu, Maosong Sun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that by solely relying on English multimodal data, VISCPM can achieve the SOTA performance among Chinese open-sourced multimodal models.
Researcher Affiliation Collaboration 1 Tsinghua University 2 Beijing University of Posts and Telecommunications 3 Shanghai Artificial Intelligence Laboratory 4 Renmin University of China 5 Zhihu Inc. 6 Model Best Inc.
Pseudocode No No pseudocode or algorithm blocks were found in the paper. The training processes are described in narrative text.
Open Source Code Yes To facilitate future research, we open-source codes and model weights at https://github.com/Open BMB/Vis CPM.
Open Datasets Yes B.1 PRETRAINING DATASET: COCO (Lin et al., 2014): The COCO dataset is a meticulously compiled image caption dataset... Visual Genome (Krishna et al., 2017)... CC3M (Sharma et al., 2018)... CC12M (Changpinyo et al., 2021)... Laion2B (Schuhmann et al., 2022)... Laion-COCO (Christoph et al., 2022)... Wukong (Gu et al., 2022)... Zero (Xie et al., 2022)... B.2 INSTRUCTION TUNING DATASET: LLa VA-Instruct-150K (Liu et al., 2023a)... M3IT (Li et al., 2023b)... Uni MM-Chat (Yu et al., 2023a)...
Dataset Splits No The paper uses various datasets for pretraining and instruction tuning, and specific benchmarks for evaluation (e.g., LLa VA Test Set), but does not provide explicit train/validation/test splits with percentages or counts for any single dataset that would allow reproduction of the data partitioning.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud computing instance types used for running the experiments. It only generally refers to 'GPUs' in some contexts, but not as the experimental setup specification.
Software Dependencies No The paper mentions various software components and models (e.g., CPM-Bee, Stable Diffusion, GPT-4, PyTorch, CUDA, etc.) but does not specify their version numbers, which is necessary for reproducible software dependencies.
Experiment Setup Yes We train VISCPM-Chat and VISCPM-Chat+ for 180K and 480K steps, respectively, with a batch size of 768 and a learning rate of 1e-5. The instruction tuning lasts for 80K steps with a batch size of 64.