reproducibilityindex.ai

Retrieval-Augmented Multimodal Language Modeling

Authors: Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Richard James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, Wen-Tau Yih

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO)
Researcher Affiliation	Collaboration	1Stanford University 2Meta AI 3University of Washington.
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks labeled as such.
Open Source Code	No	The paper does not provide an explicit statement or link to its own open-source code for the described methodology.
Open Datasets	Yes	To train our model, we use LAION (Schuhmann et al., 2021), an open-sourced dataset that consists of text-image pairs collected from the web.
Dataset Splits	Yes	For the main evaluation, we use the standard benchmark, MS-COCO (Lin et al., 2014), to evaluate both text-to-image and image-to-text generation. ... We evaluate our trained model with no further finetuning. ... we generate images for the MS-COCO validation set captions and measure the FID score... we generate captions for the MS-COCO validation set images and measure the CIDEr score...
Hardware Specification	Yes	The model is trained from scratch for five days on 256 A100 GPUs.
Software Dependencies	No	The paper mentions software like PyTorch, Metaseq, FAISS, and VQGAN with citations but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	The sequence length is 4096, which can take up to 3 documents. ... The model is trained from scratch for five days on 256 A100 GPUs. Our implementation is in PyTorch (Paszke et al., 2019) using Metaseq (Zhang et al., 2022). We use model parallelism over 4 GPUs and a batch size of 16 sequences per GPU. The optimization uses a linear learning rate decay with 1500 warmup steps, a peak learning rate of 1e-4, a gradient clipping of 1.0, and the Adam optimizer with β1 = 0.9, β2 = 0.98 (Kingma & Ba, 2015).