DreamLLM: Synergistic Multimodal Comprehension and Creation
Authors: Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, Li Yi
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments highlight DREAMLLM s superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy. Project page: dreamllm.github.io. 4 EXPERIMENTS DREAMLLM is a versatile multimodal generalist that excels at zero-shot or in-context visionlanguage comprehension and synthesis tasks. In this section, we conduct systematic evaluations for demonstration. |
| Researcher Affiliation | Collaboration | Runpei Dong 12 Chunrui Han 3 Yuang Peng 4 Zekun Qi 12 Zheng Ge 3 Jinrong Yang 5 Liang Zhao 3 Jianjian Sun 3 Hongyu Zhou 3 Haoran Wei 3 Xiangwen Kong 3 Xiangyu Zhang 3 Kaisheng Ma 4 Li Yi 467 1Xi an Jiaotong University 2Institute for Interdisciplinary Information Core Technology (IIISCT) 3MEGVII Technology 4Tsinghua University 5HUST 6Shanghai Artificial Intelligence Laboratory 7Shanghai Qi Zhi Institute |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page: dreamllm.github.io. |
| Open Datasets | Yes | The training data are constructed based on the following datasets: a) LAION400M (Schuhmann et al., 2021), b) LAION-COCO (Schuhmann et al., 2023), c) MMC4 (Zhu et al., 2023b), d) BLIP-LAION (Li et al., 2022)... |
| Dataset Splits | Yes | The MS-COCO dataset primarily contains high-level image abstractions with shorter captions, whereas LN-COCO provides more comprehensive image descriptions (Yu et al., 2022b). DREAMLLM samples 8 images per text prompt on MSCOCO by CLIP score ranking, following previous works (Ramesh et al., 2022). On LN-COCO, DREAMLLM samples one image per prompt without CLIP ranking since the text is too long and exceeds the CLIP length limit. |
| Hardware Specification | Yes | GPU Device 128 NVIDIA A800 |
| Software Dependencies | No | We use LLa MA-1 (Touvron et al., 2023a) trained on Share GPT (Zheng et al., 2023) as as the default LLM (i.e., Vicuna-7B1 (Chiang et al., 2023)) following Liu et al. (2023c) to endow its instruction-following capacity. During training, we use Flash Attention (Dao et al., 2022) and Py Torch FSDP (Zhao et al., 2023b) to accelerate training efficiency. |
| Experiment Setup | Yes | Training Hyper-Parameters Optimizer Adam W Learning Rate 2e-3 Weight Decay 0.0 Training Epochs 1 Warmup Ratio 0.003 Learning Rate Scheduler Cosine Batch Size Per GPU 8 Maximum Token Length 2048 |