I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing
Authors: Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, Rongrong Ji
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | I2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding original and diverse instructions. It offers three distinctive characteristics: 1) Comprehensive Evaluation Dimensions: I2EBench comprises 16 evaluation dimensions that cover both high-level and low-level aspects, providing a comprehensive assessment of each IIE model. 2) Human Perception Alignment: To ensure the alignment of our benchmark with human perception, we conducted an extensive user study for each evaluation dimension. |
| Researcher Affiliation | Collaboration | Yiwei Ma1 Jiayi Ji1 Ke Ye1 Weihuang Lin1 Zhibin Wang2 Yonghan Zheng1 Qiang Zhou2 Xiaoshuai Sun1 Rongrong Ji1 1 Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. 2 Inf Tech Company, Hangzhou, 310000, P.R. China. |
| Pseudocode | No | The paper does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We will open-source I2EBench, including all instructions, input images, human annotations, edited images from all evaluated methods, and a simple script for evaluating the results from new IIE models. The code, dataset and generated images from all IIE models are provided in github: https://github.com/cocoshe/I2EBench. |
| Open Datasets | Yes | We meticulously curated approximately 140 images from publicly available datasets Lin et al. [2014], Guo et al. [2023b], Martin et al. [2001], Chen et al. [2021], Ancuti et al. [2019], Liu et al. [2021b,a], Qu et al. [2017], Nah et al. [2017], Shen et al. [2019], Wei et al. [2018] for each evaluation dimension of I2EBench. |
| Dataset Splits | No | The paper describes its benchmark for evaluation and the process of human annotation and evaluation, but it does not specify training, validation, and test splits for its own data collection or processing, as it is a benchmark rather than a model being trained. |
| Hardware Specification | No | The paper states in its NeurIPS checklist that 'We did not train the IIE model, and all checkpoints are sourced from the official code, so there is no need to report them.' It does not specify the hardware used for running the evaluation or analysis presented in the paper. |
| Software Dependencies | No | The paper mentions utilizing 'official codes from various models for image editing' and using 'Chat GPT Achiam et al. [2023]' and 'GPT-4V model' for evaluations, but it does not specify the version numbers of any software libraries, frameworks, or dependencies used in their own experimental setup or evaluation pipeline. |
| Experiment Setup | No | The paper describes the evaluation methodology for each dimension (e.g., using GPT-4V, CLIP similarity, SSIM) and details the human evaluation process. However, it does not provide specific hyperparameters or system-level training settings for its own benchmark evaluation process or the models it uses, stating that it uses 'official codes from various models for image editing' without detailing their configurations. |