reproducibilityindex.ai

VLUE: A Multi-Task Multi-Dimension Benchmark for Evaluating Vision-Language Pre-training

Authors: Wangchunshu Zhou, Yan Zeng, Shizhe Diao, Xinsong Zhang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate a range of representative VLP models on VLUE to facilitate future research and analyze their generalization ability and efficiency-performance trade-off with respect to several key design choices. We find that there is a sizable generalization gap for all VLP models when evaluating on new examples annotated with images from in-the-wild distribution. Also, compared to focusing on a single dimension (i.e., absolute performance), measuring the generalization ability of different models can lead to complementary and even controversial conclusions. We also find that models with similar performance may result in completely different positions in the Pareto front measuring the efficiency-performance trade-off of VLP models, which also demonstrates the necessity of a multi-dimension benchmark for evaluating VLP models.
Researcher Affiliation	Collaboration	Wangchunshu Zhou * 1 Yan Zeng * 1 Shizhe Diao * 2 Xinsong Zhang * 1 *Equal contribution 1Byte Dance AI Lab 2The Hong Kong University of Science and Technology. Correspondence to: Wangchunshu Zhou <wcszhou@outlook.com>.
Pseudocode	No	The paper describes methods and processes in narrative text but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	The data and codes used for training baseline models are available at https://github. com/Michael Zhouwang/VLUE.
Open Datasets	Yes	We release the VLUE benchmark1 to promote research on building vision-language models that generalize well to more diverse images and concepts unseen during pre-training, and are practical in terms of efficiency-performance trade-off. 1The benchmark is publicly available at https:// vlue-benchmark.github.io. The data and codes used for training baseline models are available at https://github. com/Michael Zhouwang/VLUE.
Dataset Splits	Yes	Table 1: Characteristics of the datasets in VLUE. Task Dataset Image Domain \|Train\| \|Dev\| \|Test\| \|OOD Test\| Metric Image-Text Retrieval MSCOCO COCO 566,747 25,010 25,010 27,796 R@1
Hardware Specification	Yes	We fix the hardware environment to 1 Nvidia Tesla V100 GPU and the batch size to 1 to simulate real application scenarios5. 5The actual inference time of different models depends on hardware.
Software Dependencies	No	The paper mentions various VLP models and general concepts like transformers and convolutional networks but does not provide specific software dependencies with version numbers (e.g., 'PyTorch 1.9', 'TensorFlow 2.x') required for replication.
Experiment Setup	Yes	For each model, we fine-tune the released pre-trained checkpoint on the VLUE tasks with the hyperparameters provided in the paper. We only consider tasks for which the original paper reported results. ... After fine-tuning, we evaluate the performance of fine-tuned models on the corresponding OOD test sets in the zero-shot fashion. In addition to the absolute performance, we also record the actual inference time of different models in a controlled setting where the hardware environment is fixed for all models. ... We fix the hardware environment to 1 Nvidia Tesla V100 GPU and the batch size to 1 to simulate real application scenarios.