Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Grounding Language Models to Images for Multimodal Inputs and Outputs
Authors: Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. 4. Experiments |
| Researcher Affiliation | Academia | 1Carnegie Mellon University. Correspondence to: Jing Yu Koh <EMAIL>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and pretrained models are made publicly available2 to encourage future work and exploration. 2https://github.com/kohjingyu/fromage |
| Open Datasets | Yes | We train on the Conceptual Captions (CC3M) dataset (Sharma et al., 2018) consisting of 3.3 million image-text pairs. |
| Dataset Splits | No | The paper mentions evaluating on the MS-COCO validation set, but does not specify a training/validation/test split for its primary training dataset (CC3M) in the context of hyperparameter tuning or model development. |
| Hardware Specification | Yes | Our models are trained with a batch size of 180 for 1 epoch (18000 iterations) on 1 A6000 GPU (clock time of 24 hours). |
| Software Dependencies | Yes | All models are implemented in Py Torch (Paszke et al., 2019) v1.12 and trained mixed-precision with bfloat16 (Abadi et al., 2016). |
| Experiment Setup | Yes | Our models are trained with a batch size of 180 for 1 epoch (18000 iterations) on 1 A6000 GPU (clock time of 24 hours). We use the Adam (Kingma & Ba, 2015) optimizer with a learning rate of 0.0003 and warmup of 100 steps. The loss weights λc and λr are set to 1 and we use a visual prefix length of k = 1 and retrieval embedding dimension q = 256, and embedding dimension d = 4096 (inherited from OPT-6.7B). |