Wings: Learning Multimodal LLMs without Text-only Forgetting

Authors: Yi-Kai Zhang, Shiyin Lu, Yang Li, YanQing Ma, Qingguo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that WINGS outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. WINGS with compensation of learners addresses text-only forgetting during visual modality expansion in general MLLMs.
Researcher Affiliation Collaboration Yi-Kai Zhang1,2,3 Shiyin Lu3 Yang Li3 Yanqing Ma3 Qing-Guo Chen3 Zhao Xu3 Weihua Luo3 Kaifu Zhang3 De-Chuan Zhan1,2 Han-Jia Ye1,2 1School of Artificial Intelligence, Nanjing University 2National Key Laboratory for Novel Software Technology, Nanjing University 3Alibaba International Digital Commerce
Pseudocode No The paper describes the model architecture and training process in text and diagrams (Figure 3, Figure 4) but does not include pseudocode or an algorithm block.
Open Source Code No The paper states: 'The code of our proposed method will be released upon acceptance.' This indicates future release, not concrete access at the time of publication.
Open Datasets Yes We evaluate on MMMU [138], MME [37], MMBench [84] (MMB) in English (EN) and Chinese (CN), Science QA [87] for test (Sci QA), SEED-Bench [64] for image part (SEED), AI2D [57] for test, and Hallusion Bench [41] (Hall B).
Dataset Splits No The paper mentions 'MMMU-VAL' as a benchmark but does not provide specific details on how validation splits are used or created for model training and evaluation.
Hardware Specification Yes These two types of MLLM require about 1.5 and 6 days of training on 8 A100 GPUs, respectively. The training datasets for WINGSmini are consistent with the WINGSpro. It takes approximately 5 days to run on 4 A100 GPUs.
Software Dependencies No The paper mentions using Python and various frameworks implicitly through citations to other papers, but it does not specify exact version numbers for any software dependencies like PyTorch, TensorFlow, or specific library versions required for reproducibility.
Experiment Setup Yes We train for 1 epoch with the Adam W optimizer and the Cosine learning schedule. Typically, the learning rates for the first and second stages are set at 1e 3 and 2e 6 (with the projector part as 1e 5), respectively.