What You See is What You Read? Improving Text-Image Alignment Evaluation
Authors: Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive experiments on See TRUE, demonstrating that both our VQ2 and VNLI methods outperform a wide range of strong baselines, including various versions of CLIP [15], COCA [22], BLIP [6, 7], and OFA [23]. |
| Researcher Affiliation | Collaboration | GGoogle Research HThe Hebrew University of Jerusalem |
| Pseudocode | No | No, the paper describes its methods in text and with diagrams (e.g., Figure 3 for VQ2 pipeline) but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Data and code are attached to this submission. (5) We release our evaluation suite, models and code to foster future work. |
| Open Datasets | Yes | See TRUE: a benchmark for image-text alignment encompassing 31,855 real and synthetic image-text pairs from diverse datasets and tasks. ... MS COCO [28]: https://cocodataset.org/#termsofuse ... Edit Bench [25]: https://research.google/resources/datasets/editbench/, https://www.apache.org/licenses/LICENSE-2.0 ... Draw Bench [5]: https://imagen.research.google/, https://docs.google.com/spreadsheets/d/1y7nAbmR4FREi6npB1u-Bo3GFdwdOPYJc617rBOxIRHY/edit#gid=0 ... Pick-a-Pick [29]: https://huggingface.co/datasets/yuvalkirstain/pickapic_v1 ... SNLI-VE [27]: https://github.com/necla-ml/SNLI-VE ... Winoground [24]: https://huggingface.co/datasets/facebook/winoground |
| Dataset Splits | Yes | We note that some of the datasets are only used for testing (e.g., Winoground, Draw Bench, Edit Bench) while others include both training and test sets (e.g., SNLI-VE, COCO t2i, COCO-Con, Picka Pic Con). This allows us to investigate different training configurations and their effect on performance. ... We train the model for two epochs and designate 10% of the training set as a validation set for early stopping and use learning rate selection between {1e-5, 5e-5}. |
| Hardware Specification | Yes | A single training took 5 hours on a linux server with one A6000 GPU. ... Four v4 chips ... 16 TPU v3 cores + 4 v4 chips |
| Software Dependencies | No | No, the paper mentions software/frameworks like "T5-XXL model", "Pa LM [35]", "Spa Cy [38]", "Pa LI-17B [42]", "BLIP2 [7]", "Adam optimizer", "T5X [62]", "JAX [63]", and "Pytorch". However, it does not consistently provide specific version numbers for these components, which is required for a reproducible software description. |
| Experiment Setup | Yes | We train the model for two epochs and designate 10% of the training set as a validation set for early stopping and use learning rate selection between {1e-5, 5e-5}. |