I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing

Authors: Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, Rongrong Ji

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental I2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding original and diverse instructions. It offers three distinctive characteristics: 1) Comprehensive Evaluation Dimensions: I2EBench comprises 16 evaluation dimensions that cover both high-level and low-level aspects, providing a comprehensive assessment of each IIE model. 2) Human Perception Alignment: To ensure the alignment of our benchmark with human perception, we conducted an extensive user study for each evaluation dimension.
Researcher Affiliation Collaboration Yiwei Ma1 Jiayi Ji1 Ke Ye1 Weihuang Lin1 Zhibin Wang2 Yonghan Zheng1 Qiang Zhou2 Xiaoshuai Sun1 Rongrong Ji1 1 Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. 2 Inf Tech Company, Hangzhou, 310000, P.R. China.
Pseudocode No The paper does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We will open-source I2EBench, including all instructions, input images, human annotations, edited images from all evaluated methods, and a simple script for evaluating the results from new IIE models. The code, dataset and generated images from all IIE models are provided in github: https://github.com/cocoshe/I2EBench.
Open Datasets Yes We meticulously curated approximately 140 images from publicly available datasets Lin et al. [2014], Guo et al. [2023b], Martin et al. [2001], Chen et al. [2021], Ancuti et al. [2019], Liu et al. [2021b,a], Qu et al. [2017], Nah et al. [2017], Shen et al. [2019], Wei et al. [2018] for each evaluation dimension of I2EBench.
Dataset Splits No The paper describes its benchmark for evaluation and the process of human annotation and evaluation, but it does not specify training, validation, and test splits for its own data collection or processing, as it is a benchmark rather than a model being trained.
Hardware Specification No The paper states in its NeurIPS checklist that 'We did not train the IIE model, and all checkpoints are sourced from the official code, so there is no need to report them.' It does not specify the hardware used for running the evaluation or analysis presented in the paper.
Software Dependencies No The paper mentions utilizing 'official codes from various models for image editing' and using 'Chat GPT Achiam et al. [2023]' and 'GPT-4V model' for evaluations, but it does not specify the version numbers of any software libraries, frameworks, or dependencies used in their own experimental setup or evaluation pipeline.
Experiment Setup No The paper describes the evaluation methodology for each dimension (e.g., using GPT-4V, CLIP similarity, SSIM) and details the human evaluation process. However, it does not provide specific hyperparameters or system-level training settings for its own benchmark evaluation process or the models it uses, stating that it uses 'official codes from various models for image editing' without detailing their configurations.