A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis
Authors: Nailei Hei, Qianyu Guo, Zihao Wang, Yan Wang, Haofen Wang, Wenqiang Zhang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that our approach is capable of generating more visually appealing and diverse images than previous state-of-the-art methods, achieving an average improvement of 5% across six quality and aesthetic metrics. Data and code are available at https://github.com/Naylenv/UF-FGTG. |
| Researcher Affiliation | Academia | 1Shanghai Engineering Research Center of AI & Robotics, Academy for Engineering & Technology, Fudan University 2Engineering Research Center of Robotics, Ministry of Education, Academy for Engineering & Technology, Fudan University 3Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University 4Tongji University 5College of Design and Innovation, Tongji University |
| Pseudocode | No | The paper describes methods and processes but does not include structured pseudocode or explicitly labeled algorithm blocks. |
| Open Source Code | Yes | Data and code are available at https://github.com/Naylenv/UF-FGTG. |
| Open Datasets | Yes | Data and code are available at https://github.com/Naylenv/UF-FGTG. Specifically, we employ Diffusion DB (Wang et al. 2022), a large-scale dataset frequently employed for training in textto-image tasks, as our analysis dataset. We build a coarsefine granularity prompts dataset based on Lexica.art (Santana 2022), which consists of 81,910 fine-grained prompts filtered and extracted from user communities. |
| Dataset Splits | No | We split 73,718 data pairs as the training set and 8,192 data instances as the testing set. The paper does not explicitly mention a validation set split. |
| Hardware Specification | Yes | We conduct our experiments on NVIDIA A100 GPUs. |
| Software Dependencies | Yes | We utilize the CLIP model as the fine-grained text encoder and the T5 model (Raffel et al. 2020) as the text decoder to articulate our methodology... The fine-grained text encoder is initialized using Open CLIP (Cherti et al. 2023) derived from the Stable Diffusion model (Rombach et al. 2022). The text decoder DT is initialized with a FLAN-T5 (Chung et al. 2022) pretrained generative language model. |
| Experiment Setup | Yes | The parameters used for image generation include step , seed , height , width , CFG scale , and sampler ... During training, we train the fine-grained text encoder, domain adapter, and adaptive feature extraction module on our CFP dataset for 100 epochs, using the Adam W optimizer (Loshchilov and Hutter 2018), a learning rate of 5e-5, and a batch size of 16... For the image generation phase, we utilize Stable Diffusion-v2.1, setting the CFG scale to 7, and perform 50 denoising steps using the Euler Ancestral sampler (Karras et al. 2022). |