What You See is What You Read? Improving Text-Image Alignment Evaluation

Authors: Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive experiments on See TRUE, demonstrating that both our VQ2 and VNLI methods outperform a wide range of strong baselines, including various versions of CLIP [15], COCA [22], BLIP [6, 7], and OFA [23].
Researcher Affiliation Collaboration GGoogle Research HThe Hebrew University of Jerusalem
Pseudocode No No, the paper describes its methods in text and with diagrams (e.g., Figure 3 for VQ2 pipeline) but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes Data and code are attached to this submission. (5) We release our evaluation suite, models and code to foster future work.
Open Datasets Yes See TRUE: a benchmark for image-text alignment encompassing 31,855 real and synthetic image-text pairs from diverse datasets and tasks. ... MS COCO [28]: https://cocodataset.org/#termsofuse ... Edit Bench [25]: https://research.google/resources/datasets/editbench/, https://www.apache.org/licenses/LICENSE-2.0 ... Draw Bench [5]: https://imagen.research.google/, https://docs.google.com/spreadsheets/d/1y7nAbmR4FREi6npB1u-Bo3GFdwdOPYJc617rBOxIRHY/edit#gid=0 ... Pick-a-Pick [29]: https://huggingface.co/datasets/yuvalkirstain/pickapic_v1 ... SNLI-VE [27]: https://github.com/necla-ml/SNLI-VE ... Winoground [24]: https://huggingface.co/datasets/facebook/winoground
Dataset Splits Yes We note that some of the datasets are only used for testing (e.g., Winoground, Draw Bench, Edit Bench) while others include both training and test sets (e.g., SNLI-VE, COCO t2i, COCO-Con, Picka Pic Con). This allows us to investigate different training configurations and their effect on performance. ... We train the model for two epochs and designate 10% of the training set as a validation set for early stopping and use learning rate selection between {1e-5, 5e-5}.
Hardware Specification Yes A single training took 5 hours on a linux server with one A6000 GPU. ... Four v4 chips ... 16 TPU v3 cores + 4 v4 chips
Software Dependencies No No, the paper mentions software/frameworks like "T5-XXL model", "Pa LM [35]", "Spa Cy [38]", "Pa LI-17B [42]", "BLIP2 [7]", "Adam optimizer", "T5X [62]", "JAX [63]", and "Pytorch". However, it does not consistently provide specific version numbers for these components, which is required for a reproducible software description.
Experiment Setup Yes We train the model for two epochs and designate 10% of the training set as a validation set for early stopping and use learning rate selection between {1e-5, 5e-5}.