Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Authors: Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, Bin Cui

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment.
Researcher Affiliation Collaboration 1Peking University, China 2Stanford University, USA 3Pika Labs, USA.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/Yang Ling0818/RPGDiffusion Master.
Open Datasets Yes For quantitative results, we assess the text-image alignment of our method in a comprehensive benchmark, T2I-Compbench (Huang et al., 2023a)
Dataset Splits No The paper evaluates on T2I-Compbench but does not specify the train/validation/test dataset splits used for its experiments.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory amounts) used to run its experiments.
Software Dependencies No The paper mentions using GPT-4 and SDXL, but does not provide specific version numbers for these or any other key software components, libraries, or frameworks.
Experiment Setup Yes Base prompt and its weighted hyperparameter base ratio are critical in our regional diffusion, we have provide further analysis in Figure 16. When the user prompt includes the entities with same class (e.g., two women, four boys), we need to set higher base ratio to highlight these distinct identities. On the contrary, when user prompt includes the the entities with different class name (e.g., ceramic vase and glass vase), we need lower base ratio to avoid the confusion between the base prompt and subprompts.