ContextRef: Evaluating Referenceless Metrics for Image Description Generation
Authors: Elisa Kreiss, Eric Zelikman, Christopher Potts, Nick Haber
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we introduce Context Ref, a benchmark for assessing referenceless metrics for such alignment. Context Ref has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of Context Ref is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using Context Ref, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with Context Ref, but we show that careful fine-tuning yields substantial improvements. |
| Researcher Affiliation | Academia | Elisa Kreiss , Eric Zelikman , Christopher Potts, Nick Haber ekreiss@ucla.edu, {ezelikman, cgpotts, nhaber}@stanford.edu |
| Pseudocode | No | The paper includes code snippets in Appendix H to illustrate prompt construction, but these are not presented as formal pseudocode or algorithm blocks for the overall method. |
| Open Source Code | Yes | 1All data and code are made available at https://github.com/elisakreiss/contextref. |
| Open Datasets | Yes | The data was randomly sampled from the English language subset of the WIT dataset (Srinivasan et al., 2021). |
| Dataset Splits | No | The paper states, "We split the data into an 80% train and 20% test split," but it does not explicitly mention a separate validation set or its proportion. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions various models and their base components (e.g., "Open Flamingo v2," "GPT-2 large," "BLIP-2 variants with Flan-T5 XXL"), but it does not list specific version numbers for underlying software libraries or dependencies like PyTorch, TensorFlow, or CUDA, which are crucial for full reproducibility. |
| Experiment Setup | Yes | We first trained the best-performing CLIP model for 0.5 epochs with a learning rate of 5e 6 and a batch size of 64, with the Adam optimizer (Kingma & Ba, 2014). |