reproducibilityindex.ai

Grounding Language Models to Images for Multimodal Inputs and Outputs

Authors: Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. 4. Experiments
Researcher Affiliation	Academia	1Carnegie Mellon University. Correspondence to: Jing Yu Koh <jingyuk@cs.cmu.edu>.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and pretrained models are made publicly available2 to encourage future work and exploration. 2https://github.com/kohjingyu/fromage
Open Datasets	Yes	We train on the Conceptual Captions (CC3M) dataset (Sharma et al., 2018) consisting of 3.3 million image-text pairs.
Dataset Splits	No	The paper mentions evaluating on the MS-COCO validation set, but does not specify a training/validation/test split for its primary training dataset (CC3M) in the context of hyperparameter tuning or model development.
Hardware Specification	Yes	Our models are trained with a batch size of 180 for 1 epoch (18000 iterations) on 1 A6000 GPU (clock time of 24 hours).
Software Dependencies	Yes	All models are implemented in Py Torch (Paszke et al., 2019) v1.12 and trained mixed-precision with bfloat16 (Abadi et al., 2016).
Experiment Setup	Yes	Our models are trained with a batch size of 180 for 1 epoch (18000 iterations) on 1 A6000 GPU (clock time of 24 hours). We use the Adam (Kingma & Ba, 2015) optimizer with a learning rate of 0.0003 and warmup of 100 steps. The loss weights λc and λr are set to 1 and we use a visual prefix length of k = 1 and retrieval embedding dimension q = 256, and embedding dimension d = 4096 (inherited from OPT-6.7B).