Linearly Mapping from Image to Text Space

Authors: Jack Merullo, Louis Castricato, Carsten Eickhoff, Ellie Pavlick

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that the image representations from vision models can be transferred as continuous prompts to frozen LMs by training only a single linear projection. Using these to prompt the LM achieves competitive performance on captioning and visual question answering tasks compared to models that tune both the image encoder and text decoder (such as the MAGMA model).
Researcher Affiliation Academia Jack Merullo, Louis Castricato, Carsten Eickhoff, Ellie Pavlick Department of Computer Science Brown University Providence, RI, USA {jack merullo,louis castricato,carsten,ellie pavlick}@brown.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available here: https://github.com/jmerullo/limber
Open Datasets Yes All models are trained with the same basic hyperparameters and settings as described in the MAGMA paper (see Appendix A for details) on the Conceptual Captions 3M dataset (CC3M, Sharma et al. (2018)) for 15,000 training steps.
Dataset Splits Yes Using the COCO validation set, we count the top 50 nouns, modifiers (e.g., adjectives), and relations (e.g., verbs, prepositional phrases) that appear in the ground truth captions and calculate how often they appear in the generated captions that were used to calculate the scores in Table 1.
Hardware Specification Yes All models are trained for 15,000 training steps across 16 A100 GPUs for approximately 1.75 days.
Software Dependencies No The paper mentions optimizers like 'Adam W' and 'Ze RO stage 2' but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes During training, we minimize the loss with the Adam W (Loshchilov & Hutter, 2018) optimizer per mini-batch, with the help of Ze RO stage 2 (Rajbhandari et al., 2019). We use a dropout probability of 0.1, a weight decay of 0, betas = (0.9, 0.95), and gradient clipping = 1.0. All models are trained for 15,000 training steps across 16 A100 GPUs for approximately 1.75 days. Our effective batch size was 2048. We use a learning rate of 8 10 4 for the projection layer P. For models where we tune E as well, we tune its parameters with a learning rate of 2 10 6.