RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models
Authors: Xinchen Zhang, Ling Yang, YaQi Cai, Zhaochen Yu, Kai-Ni Wang, xie jiake, Ye Tian, Minkai Xu, Yong Tang, Yujiu Yang, Bin CUI
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our Real Compo consistently outperforms state-of-the-art text-to-image models and spatial-aware image diffusion models in multiple-object compositional generation while keeping satisfactory realism and compositionality of the generated images. Notably, our Real Compo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models. |
| Researcher Affiliation | Collaboration | Xinchen Zhang1 Ling Yang2 Yaqi Cai3 Zhaochen Yu2 Kai-Ni Wang4 Jiake Xie5 Ye Tian2 Minkai Xu6 Yong Tang5 Yujiu Yang1 Bin Cui2 1Tsinghua University 2 Peking University 3 University of Science and Technology of China 4 Southeast University 5 Lib AI Lab 6 Stanford University |
| Pseudocode | Yes | Algorithm 1 Compositional denoising procedure of layout-based Real Compo |
| Open Source Code | Yes | https://github.com/YangLing0818/RealCompo |
| Open Datasets | Yes | To evaluate compositionality, we compare our Real Compo with the outstanding T2I and L2I models on T2I-Comp Bench [21]. This benchmark test models across aspects of attribute binding, object relationship, numeracy and complexity. To evaluate realism, we randomly select 3K text prompts from the COCO validation set |
| Dataset Splits | No | The paper mentions using a 'COCO validation set' for evaluation, but this refers to a dataset used for testing, not a train/validation/test split for training their own model. The models used (SD v1.5, GLIGEN, SDXL, ControlNet) are pre-trained. The paper does not specify how data was split for their specific framework's development or tuning, only for evaluation against benchmarks. |
| Hardware Specification | Yes | All of our experiments are conducted under 1 NVIDIA 80G-A100 GPU. |
| Software Dependencies | No | We selected GPT-4 [1] as the layout generator in our experiments... The paper mentions GPT-4 but does not specify its version or the versions of any other software dependencies (e.g., Python, PyTorch, CUDA) required to replicate the experiments. |
| Experiment Setup | Yes | Implementation Details Our Real Compo is a generic, scalable framework that can achieve the complementary advantages of the model with any chosen (stylized) T2I models and spatial-aware image diffusion models. We selected GPT-4 [1] as the layout generator in our experiments, the detailed rules are described in Appendix C.1. For layout-based Real Compo, we chose SD v1.5 [41] and GLIGEN [27] as the backbone. For keypoint-based Real Compo, we chose SDXL [4] and Control Net [72] as the backbone. For segmentation-based Real Compo, we chose SD v2.1 [41] and Control Net [72] as the backbone. For style-based Real Compo, we chose two stylized T2I models: Coloring Page Diffusion and Cute Yuki Mix as the backbone, and chose GLIGEN [27] as the backbone of L2I model. All of our experiments are conducted under 1 NVIDIA 80G-A100 GPU. |