Grounding Language Models to Images for Multimodal Inputs and Outputs
Authors: Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. 4. Experiments |
| Researcher Affiliation | Academia | 1Carnegie Mellon University. Correspondence to: Jing Yu Koh <jingyuk@cs.cmu.edu>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and pretrained models are made publicly available2 to encourage future work and exploration. 2https://github.com/kohjingyu/fromage |
| Open Datasets | Yes | We train on the Conceptual Captions (CC3M) dataset (Sharma et al., 2018) consisting of 3.3 million image-text pairs. |
| Dataset Splits | No | The paper mentions evaluating on the MS-COCO validation set, but does not specify a training/validation/test split for its primary training dataset (CC3M) in the context of hyperparameter tuning or model development. |
| Hardware Specification | Yes | Our models are trained with a batch size of 180 for 1 epoch (18000 iterations) on 1 A6000 GPU (clock time of 24 hours). |
| Software Dependencies | Yes | All models are implemented in Py Torch (Paszke et al., 2019) v1.12 and trained mixed-precision with bfloat16 (Abadi et al., 2016). |
| Experiment Setup | Yes | Our models are trained with a batch size of 180 for 1 epoch (18000 iterations) on 1 A6000 GPU (clock time of 24 hours). We use the Adam (Kingma & Ba, 2015) optimizer with a learning rate of 0.0003 and warmup of 100 steps. The loss weights λc and λr are set to 1 and we use a visual prefix length of k = 1 and retrieval embedding dimension q = 256, and embedding dimension d = 4096 (inherited from OPT-6.7B). |