Emu: Generative Pretraining in Multimodality

Authors: Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. and We evaluate Emu on a broad range of vision-language tasks including image captioning (MSCOCO (Chen et al., 2015)), image question answering (VQAv2 (Goyal et al., 2017)...
Researcher Affiliation Collaboration Quan Sun1 Qiying Yu2,1 Yufeng Cui1 Fan Zhang1 Xiaosong Zhang1 Yueze Wang1 Hongcheng Gao1 Jingjing Liu2 Tiejun Huang1,3 Xinlong Wang1 1 Beijing Academy of Artificial Intelligence 2 Tsinghua University 3 Peking University
Pseudocode No The paper contains architectural diagrams and descriptions but no structured pseudocode or algorithm blocks.
Open Source Code Yes Code & Demo: https://github.com/baaivision/Emu
Open Datasets Yes We pretrain Emu with web-scale data across modalities in various forms, including image-text pairs (LAION-2B (Schuhmann et al., 2022), LAION-COCO (lai, b)), interleaved images-text data (MMC4 (Zhu et al., 2023b)), video-text pairs (Web Vid-10M (Bain et al., 2021)), and our collected interleaved video-text data (YT-Storyboard-1B). and LAION-Aesthetics (lai, a) is the subset of LAION-5B (Schuhmann et al., 2022).
Dataset Splits Yes We evaluate the zero-shot image generation ability on the validation set of MS-COCO (Lin et al., 2014). and For each test set sample, we select examples from the training set based on the highest cosine similarity using the extracted features, including them in the prompt. and also Table 9 lists OKVQA Val and Vis Dial Val.
Hardware Specification Yes We train the model on 128 NVIDIA 80G-A100 GPUs for 10k steps and We train the diffusion model with 32 A100-40G GPUs for 15k iterations.
Software Dependencies No The paper mentions software components like LLaMA, Stable Diffusion, and AdamW optimizer, but does not provide specific version numbers for any of them.
Experiment Setup Yes We train the model on 128 NVIDIA 80G-A100 GPUs for 10k steps with around 82M samples (150B tokens in total), and the pretraining takes approximately 2 days. and Table 5 Summary of pretraining hyperparameters of Emu. and Table 7 Summary of Emu visual decoder training hyperparameters. detailing learning rates, batch sizes, optimizers, etc.