Grounding Language Models to Images for Multimodal Inputs and Outputs

Authors: Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. 4. Experiments
Researcher Affiliation Academia 1Carnegie Mellon University. Correspondence to: Jing Yu Koh <jingyuk@cs.cmu.edu>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and pretrained models are made publicly available2 to encourage future work and exploration. 2https://github.com/kohjingyu/fromage
Open Datasets Yes We train on the Conceptual Captions (CC3M) dataset (Sharma et al., 2018) consisting of 3.3 million image-text pairs.
Dataset Splits No The paper mentions evaluating on the MS-COCO validation set, but does not specify a training/validation/test split for its primary training dataset (CC3M) in the context of hyperparameter tuning or model development.
Hardware Specification Yes Our models are trained with a batch size of 180 for 1 epoch (18000 iterations) on 1 A6000 GPU (clock time of 24 hours).
Software Dependencies Yes All models are implemented in Py Torch (Paszke et al., 2019) v1.12 and trained mixed-precision with bfloat16 (Abadi et al., 2016).
Experiment Setup Yes Our models are trained with a batch size of 180 for 1 epoch (18000 iterations) on 1 A6000 GPU (clock time of 24 hours). We use the Adam (Kingma & Ba, 2015) optimizer with a learning rate of 0.0003 and warmup of 100 steps. The loss weights λc and λr are set to 1 and we use a visual prefix length of k = 1 and retrieval embedding dimension q = 256, and embedding dimension d = 4096 (inherited from OPT-6.7B).