Retrieval-Augmented Multimodal Language Modeling

Authors: Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Richard James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, Wen-Tau Yih

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO)
Researcher Affiliation Collaboration 1Stanford University 2Meta AI 3University of Washington.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks labeled as such.
Open Source Code No The paper does not provide an explicit statement or link to its own open-source code for the described methodology.
Open Datasets Yes To train our model, we use LAION (Schuhmann et al., 2021), an open-sourced dataset that consists of text-image pairs collected from the web.
Dataset Splits Yes For the main evaluation, we use the standard benchmark, MS-COCO (Lin et al., 2014), to evaluate both text-to-image and image-to-text generation. ... We evaluate our trained model with no further finetuning. ... we generate images for the MS-COCO validation set captions and measure the FID score... we generate captions for the MS-COCO validation set images and measure the CIDEr score...
Hardware Specification Yes The model is trained from scratch for five days on 256 A100 GPUs.
Software Dependencies No The paper mentions software like PyTorch, Metaseq, FAISS, and VQGAN with citations but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes The sequence length is 4096, which can take up to 3 documents. ... The model is trained from scratch for five days on 256 A100 GPUs. Our implementation is in PyTorch (Paszke et al., 2019) using Metaseq (Zhang et al., 2022). We use model parallelism over 4 GPUs and a batch size of 16 sequences per GPU. The optimization uses a linear learning rate decay with 1500 warmup steps, a peak learning rate of 1e-4, a gradient clipping of 1.0, and the Adam optimizer with β1 = 0.9, β2 = 0.98 (Kingma & Ba, 2015).