Large Language Models as Automated Aligners for benchmarking Vision-Language Models
Authors: Yuanfeng Ji, Chongjian GE, Weikai Kong, Enze Xie, Zhengying Liu, Zhenguo Li, Ping Luo
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs. |
| Researcher Affiliation | Collaboration | 1The University of Hong Kong, 2Huawei Noah s Ark Lab |
| Pseudocode | No | The paper describes its processes and pipeline in textual form but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Data and code will be released. |
| Open Datasets | Yes | Specifically, we obtain COCO (Lin et al., 2014) images and their associated captions, instances, relations, and text annotations from its extended datasets (Chen et al., 2015; Lin et al., 2014; Yang et al., 2022; Veit et al., 2016). |
| Dataset Splits | Yes | To the best of our knowledge, Auto-Bench represents the most extensive known collection of its kind. ...we employed a crowdsourcing approach to carefully select about 28.5K high quality samples to form a validation dataset, which is then used for performance evaluation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running its experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions specific models and APIs like GPT-4, GPT-3.5 Turbo, and Simcse, but does not specify version numbers for any software dependencies or libraries used in their implementation. |
| Experiment Setup | Yes | The training configurations employed in the instruction-tuning stage of Mini GPT-4 were followed with 5 epochs of SFT. |