LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

Authors: Hanan Gani, Shariq Farooq Bhat, Muzammal Naseer, Salman Khan, Peter Wonka

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs.Our results showcase a significant improvement in recall ( 85%) compared to the baseline Feng et al. (2023) ( 69%) (+16 % improvement). We also include a user study that demonstrates that our proposed method consistently produces coherent images that closely align with their respective textual descriptions, whereas existing approaches struggle to effectively handle longer text prompts
Researcher Affiliation Academia Hanan Gani1, Shariq Farooq Bhat2, Muzammal Naseer1, Salman Khan1,3, Peter Wonka2 1Mohamed Bin Zayed University of AI 2KAUST 3Australian National University {hanan.ghani, muzammal.naseer, salman.khan}@mbzuai.ac.ae shariq.bhat@kaust.edu.sa, pwonka@gmail.com
Pseudocode Yes We provide a pseudo code of our algorithm in Algorithm 1.Algorithm 1 LLM Blueprint
Open Source Code Yes Our code is available at https://github.com/hananshafi/llmblueprint.
Open Datasets Yes For acquiring the long text descriptions, we ask Chat GPT to generate scenes on various themes. In addition to this, we also use the textual descriptions from some COCO (Lin et al., 2014) and PASCAL (Everingham et al., 2010) images by querying an image captioning model (Zhu et al., 2023) to generate a detailed description spanning 80-100 words.
Dataset Splits No The paper uses datasets like COCO and PASCAL but does not provide specific numerical training/validation/test dataset splits (e.g., percentages or sample counts) for reproducibility.
Hardware Specification Yes Finally, our entire pipeline runs on a single Nvidia A100 40GB GPU.
Software Dependencies Yes For implementation, we use Pytorch 2.0.
Experiment Setup Yes We use 20 diffusion steps at this point. For box refinement, we use the pre-trained image composition model of Yang et al. (2023a) which conditions on a reference image. For each box refinement, we use 50 diffusion steps.Input: Long textual description C, diffusion steps k, sampling iterations n