RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models

Authors: Xinchen Zhang, Ling Yang, YaQi Cai, Zhaochen Yu, Kai-Ni Wang, xie jiake, Ye Tian, Minkai Xu, Yong Tang, Yujiu Yang, Bin CUI

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our Real Compo consistently outperforms state-of-the-art text-to-image models and spatial-aware image diffusion models in multiple-object compositional generation while keeping satisfactory realism and compositionality of the generated images. Notably, our Real Compo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models.
Researcher Affiliation Collaboration Xinchen Zhang1 Ling Yang2 Yaqi Cai3 Zhaochen Yu2 Kai-Ni Wang4 Jiake Xie5 Ye Tian2 Minkai Xu6 Yong Tang5 Yujiu Yang1 Bin Cui2 1Tsinghua University 2 Peking University 3 University of Science and Technology of China 4 Southeast University 5 Lib AI Lab 6 Stanford University
Pseudocode Yes Algorithm 1 Compositional denoising procedure of layout-based Real Compo
Open Source Code Yes https://github.com/YangLing0818/RealCompo
Open Datasets Yes To evaluate compositionality, we compare our Real Compo with the outstanding T2I and L2I models on T2I-Comp Bench [21]. This benchmark test models across aspects of attribute binding, object relationship, numeracy and complexity. To evaluate realism, we randomly select 3K text prompts from the COCO validation set
Dataset Splits No The paper mentions using a 'COCO validation set' for evaluation, but this refers to a dataset used for testing, not a train/validation/test split for training their own model. The models used (SD v1.5, GLIGEN, SDXL, ControlNet) are pre-trained. The paper does not specify how data was split for their specific framework's development or tuning, only for evaluation against benchmarks.
Hardware Specification Yes All of our experiments are conducted under 1 NVIDIA 80G-A100 GPU.
Software Dependencies No We selected GPT-4 [1] as the layout generator in our experiments... The paper mentions GPT-4 but does not specify its version or the versions of any other software dependencies (e.g., Python, PyTorch, CUDA) required to replicate the experiments.
Experiment Setup Yes Implementation Details Our Real Compo is a generic, scalable framework that can achieve the complementary advantages of the model with any chosen (stylized) T2I models and spatial-aware image diffusion models. We selected GPT-4 [1] as the layout generator in our experiments, the detailed rules are described in Appendix C.1. For layout-based Real Compo, we chose SD v1.5 [41] and GLIGEN [27] as the backbone. For keypoint-based Real Compo, we chose SDXL [4] and Control Net [72] as the backbone. For segmentation-based Real Compo, we chose SD v2.1 [41] and Control Net [72] as the backbone. For style-based Real Compo, we chose two stylized T2I models: Coloring Page Diffusion and Cute Yuki Mix as the backbone, and chose GLIGEN [27] as the backbone of L2I model. All of our experiments are conducted under 1 NVIDIA 80G-A100 GPU.