LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts
Authors: Hanan Gani, Shariq Farooq Bhat, Muzammal Naseer, Salman Khan, Peter Wonka
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs.Our results showcase a significant improvement in recall ( 85%) compared to the baseline Feng et al. (2023) ( 69%) (+16 % improvement). We also include a user study that demonstrates that our proposed method consistently produces coherent images that closely align with their respective textual descriptions, whereas existing approaches struggle to effectively handle longer text prompts |
| Researcher Affiliation | Academia | Hanan Gani1, Shariq Farooq Bhat2, Muzammal Naseer1, Salman Khan1,3, Peter Wonka2 1Mohamed Bin Zayed University of AI 2KAUST 3Australian National University {hanan.ghani, muzammal.naseer, salman.khan}@mbzuai.ac.ae shariq.bhat@kaust.edu.sa, pwonka@gmail.com |
| Pseudocode | Yes | We provide a pseudo code of our algorithm in Algorithm 1.Algorithm 1 LLM Blueprint |
| Open Source Code | Yes | Our code is available at https://github.com/hananshafi/llmblueprint. |
| Open Datasets | Yes | For acquiring the long text descriptions, we ask Chat GPT to generate scenes on various themes. In addition to this, we also use the textual descriptions from some COCO (Lin et al., 2014) and PASCAL (Everingham et al., 2010) images by querying an image captioning model (Zhu et al., 2023) to generate a detailed description spanning 80-100 words. |
| Dataset Splits | No | The paper uses datasets like COCO and PASCAL but does not provide specific numerical training/validation/test dataset splits (e.g., percentages or sample counts) for reproducibility. |
| Hardware Specification | Yes | Finally, our entire pipeline runs on a single Nvidia A100 40GB GPU. |
| Software Dependencies | Yes | For implementation, we use Pytorch 2.0. |
| Experiment Setup | Yes | We use 20 diffusion steps at this point. For box refinement, we use the pre-trained image composition model of Yang et al. (2023a) which conditions on a reference image. For each box refinement, we use 50 diffusion steps.Input: Long textual description C, diffusion steps k, sampling iterations n |