Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
What You See is What You Read? Improving Text-Image Alignment Evaluation
Authors: Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive experiments on See TRUE, demonstrating that both our VQ2 and VNLI methods outperform a wide range of strong baselines, including various versions of CLIP [15], COCA [22], BLIP [6, 7], and OFA [23]. |
| Researcher Affiliation | Collaboration | GGoogle Research HThe Hebrew University of Jerusalem |
| Pseudocode | No | No, the paper describes its methods in text and with diagrams (e.g., Figure 3 for VQ2 pipeline) but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Data and code are attached to this submission. (5) We release our evaluation suite, models and code to foster future work. |
| Open Datasets | Yes | See TRUE: a benchmark for image-text alignment encompassing 31,855 real and synthetic image-text pairs from diverse datasets and tasks. ... MS COCO [28]: https://cocodataset.org/#termsofuse ... Edit Bench [25]: https://research.google/resources/datasets/editbench/, https://www.apache.org/licenses/LICENSE-2.0 ... Draw Bench [5]: https://imagen.research.google/, https://docs.google.com/spreadsheets/d/1y7nAbmR4FREi6npB1u-Bo3GFdwdOPYJc617rBOxIRHY/edit#gid=0 ... Pick-a-Pick [29]: https://huggingface.co/datasets/yuvalkirstain/pickapic_v1 ... SNLI-VE [27]: https://github.com/necla-ml/SNLI-VE ... Winoground [24]: https://huggingface.co/datasets/facebook/winoground |
| Dataset Splits | Yes | We note that some of the datasets are only used for testing (e.g., Winoground, Draw Bench, Edit Bench) while others include both training and test sets (e.g., SNLI-VE, COCO t2i, COCO-Con, Picka Pic Con). This allows us to investigate different training configurations and their effect on performance. ... We train the model for two epochs and designate 10% of the training set as a validation set for early stopping and use learning rate selection between {1e-5, 5e-5}. |
| Hardware Specification | Yes | A single training took 5 hours on a linux server with one A6000 GPU. ... Four v4 chips ... 16 TPU v3 cores + 4 v4 chips |
| Software Dependencies | No | No, the paper mentions software/frameworks like "T5-XXL model", "Pa LM [35]", "Spa Cy [38]", "Pa LI-17B [42]", "BLIP2 [7]", "Adam optimizer", "T5X [62]", "JAX [63]", and "Pytorch". However, it does not consistently provide specific version numbers for these components, which is required for a reproducible software description. |
| Experiment Setup | Yes | We train the model for two epochs and designate 10% of the training set as a validation set for early stopping and use learning rate selection between {1e-5, 5e-5}. |