ImagenHub: Standardizing the evaluation of conditional image generation models
Authors: Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, Wenhu Chen
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper proposes Imagen Hub, which is a one-stop library to standardize the inference and evaluation of all the conditional image generation models. Firstly, we define seven prominent tasks and curate high-quality evaluation datasets for them. Secondly, we built a unified inference pipeline to ensure fair comparison. Thirdly, we design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images. We train expert raters to evaluate the model outputs based on the proposed metrics. Our human evaluation achieves a high inter-worker agreement of Krippendorff s alpha on 76% models with a value higher than 0.4. We comprehensively evaluated a total of around 30 models and observed three key takeaways... |
| Researcher Affiliation | Academia | University of Waterloo , Ohio State University , University of California, Santa Barbara, University of Pensylvania , Central South University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://tiger-ai-lab.github.io/Imagen Hub/ and Imagen Hub Inference Library. We built a Imagen Hub library1 to evaluate all the conditional image generation models. |
| Open Datasets | Yes | We hosted all of our datasets on the Hugging Face dataset for easy access and maintenance. |
| Dataset Splits | No | The paper describes the creation and use of the Imagen Hub dataset for evaluation but does not specify any training, validation, or test splits of this dataset for their own experimental process, as they are evaluating pre-trained models rather than training new ones. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used to run the experiments. |
| Software Dependencies | No | The paper mentions software like 'Huggingface libraries' and 'Stable Diffusion' but does not provide specific version numbers for key software components like Python, PyTorch, or specific libraries used in their evaluation framework. |
| Experiment Setup | Yes | All the models either used the default setting from the official implementation or the setting suggested in Hugging Face documentation (von Platen et al., 2022). We disabled negative prompts and any prompt engineering tricks to ensure a fair comparison. We conducted human evaluation by recruiting participants from Prolific to rate the images, and our own researchers also took part in the image rating process. We assigned 3 raters for each model and computed the SC score, PQ score, and Overall human score. We set the overall score of the model as the geometric mean of SC and PR score (i.e. O = SC PQ). |