reproducibilityindex.ai

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Authors: Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, Bin Cui

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment.
Researcher Affiliation	Collaboration	1Peking University, China 2Stanford University, USA 3Pika Labs, USA.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/Yang Ling0818/RPGDiffusion Master.
Open Datasets	Yes	For quantitative results, we assess the text-image alignment of our method in a comprehensive benchmark, T2I-Compbench (Huang et al., 2023a)
Dataset Splits	No	The paper evaluates on T2I-Compbench but does not specify the train/validation/test dataset splits used for its experiments.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory amounts) used to run its experiments.
Software Dependencies	No	The paper mentions using GPT-4 and SDXL, but does not provide specific version numbers for these or any other key software components, libraries, or frameworks.
Experiment Setup	Yes	Base prompt and its weighted hyperparameter base ratio are critical in our regional diffusion, we have provide further analysis in Figure 16. When the user prompt includes the entities with same class (e.g., two women, four boys), we need to set higher base ratio to highlight these distinct identities. On the contrary, when user prompt includes the the entities with different class name (e.g., ceramic vase and glass vase), we need lower base ratio to avoid the confusion between the base prompt and subprompts.