DreamLLM: Synergistic Multimodal Comprehension and Creation

Authors: Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, Li Yi

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments highlight DREAMLLM s superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy. Project page: dreamllm.github.io. 4 EXPERIMENTS DREAMLLM is a versatile multimodal generalist that excels at zero-shot or in-context visionlanguage comprehension and synthesis tasks. In this section, we conduct systematic evaluations for demonstration.
Researcher Affiliation Collaboration Runpei Dong 12 Chunrui Han 3 Yuang Peng 4 Zekun Qi 12 Zheng Ge 3 Jinrong Yang 5 Liang Zhao 3 Jianjian Sun 3 Hongyu Zhou 3 Haoran Wei 3 Xiangwen Kong 3 Xiangyu Zhang 3 Kaisheng Ma 4 Li Yi 467 1Xi an Jiaotong University 2Institute for Interdisciplinary Information Core Technology (IIISCT) 3MEGVII Technology 4Tsinghua University 5HUST 6Shanghai Artificial Intelligence Laboratory 7Shanghai Qi Zhi Institute
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Project page: dreamllm.github.io.
Open Datasets Yes The training data are constructed based on the following datasets: a) LAION400M (Schuhmann et al., 2021), b) LAION-COCO (Schuhmann et al., 2023), c) MMC4 (Zhu et al., 2023b), d) BLIP-LAION (Li et al., 2022)...
Dataset Splits Yes The MS-COCO dataset primarily contains high-level image abstractions with shorter captions, whereas LN-COCO provides more comprehensive image descriptions (Yu et al., 2022b). DREAMLLM samples 8 images per text prompt on MSCOCO by CLIP score ranking, following previous works (Ramesh et al., 2022). On LN-COCO, DREAMLLM samples one image per prompt without CLIP ranking since the text is too long and exceeds the CLIP length limit.
Hardware Specification Yes GPU Device 128 NVIDIA A800
Software Dependencies No We use LLa MA-1 (Touvron et al., 2023a) trained on Share GPT (Zheng et al., 2023) as as the default LLM (i.e., Vicuna-7B1 (Chiang et al., 2023)) following Liu et al. (2023c) to endow its instruction-following capacity. During training, we use Flash Attention (Dao et al., 2022) and Py Torch FSDP (Zhao et al., 2023b) to accelerate training efficiency.
Experiment Setup Yes Training Hyper-Parameters Optimizer Adam W Learning Rate 2e-3 Weight Decay 0.0 Training Epochs 1 Warmup Ratio 0.003 Learning Rate Scheduler Cosine Batch Size Per GPU 8 Maximum Token Length 2048