reproducibilityindex.ai

What You See is What You Read? Improving Text-Image Alignment Evaluation

Authors: Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive experiments on See TRUE, demonstrating that both our VQ2 and VNLI methods outperform a wide range of strong baselines, including various versions of CLIP [15], COCA [22], BLIP [6, 7], and OFA [23].
Researcher Affiliation	Collaboration	GGoogle Research HThe Hebrew University of Jerusalem
Pseudocode	No	No, the paper describes its methods in text and with diagrams (e.g., Figure 3 for VQ2 pipeline) but does not include formal pseudocode or algorithm blocks.
Open Source Code	Yes	Data and code are attached to this submission. (5) We release our evaluation suite, models and code to foster future work.
Open Datasets	Yes	See TRUE: a benchmark for image-text alignment encompassing 31,855 real and synthetic image-text pairs from diverse datasets and tasks. ... MS COCO [28]: https://cocodataset.org/#termsofuse ... Edit Bench [25]: https://research.google/resources/datasets/editbench/, https://www.apache.org/licenses/LICENSE-2.0 ... Draw Bench [5]: https://imagen.research.google/, https://docs.google.com/spreadsheets/d/1y7nAbmR4FREi6npB1u-Bo3GFdwdOPYJc617rBOxIRHY/edit#gid=0 ... Pick-a-Pick [29]: https://huggingface.co/datasets/yuvalkirstain/pickapic_v1 ... SNLI-VE [27]: https://github.com/necla-ml/SNLI-VE ... Winoground [24]: https://huggingface.co/datasets/facebook/winoground
Dataset Splits	Yes	We note that some of the datasets are only used for testing (e.g., Winoground, Draw Bench, Edit Bench) while others include both training and test sets (e.g., SNLI-VE, COCO t2i, COCO-Con, Picka Pic Con). This allows us to investigate different training conﬁgurations and their effect on performance. ... We train the model for two epochs and designate 10% of the training set as a validation set for early stopping and use learning rate selection between {1e-5, 5e-5}.
Hardware Specification	Yes	A single training took 5 hours on a linux server with one A6000 GPU. ... Four v4 chips ... 16 TPU v3 cores + 4 v4 chips
Software Dependencies	No	No, the paper mentions software/frameworks like "T5-XXL model", "Pa LM [35]", "Spa Cy [38]", "Pa LI-17B [42]", "BLIP2 [7]", "Adam optimizer", "T5X [62]", "JAX [63]", and "Pytorch". However, it does not consistently provide specific version numbers for these components, which is required for a reproducible software description.
Experiment Setup	Yes	We train the model for two epochs and designate 10% of the training set as a validation set for early stopping and use learning rate selection between {1e-5, 5e-5}.