A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis

Authors: Nailei Hei, Qianyu Guo, Zihao Wang, Yan Wang, Haofen Wang, Wenqiang Zhang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our approach is capable of generating more visually appealing and diverse images than previous state-of-the-art methods, achieving an average improvement of 5% across six quality and aesthetic metrics. Data and code are available at https://github.com/Naylenv/UF-FGTG.
Researcher Affiliation Academia 1Shanghai Engineering Research Center of AI & Robotics, Academy for Engineering & Technology, Fudan University 2Engineering Research Center of Robotics, Ministry of Education, Academy for Engineering & Technology, Fudan University 3Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University 4Tongji University 5College of Design and Innovation, Tongji University
Pseudocode No The paper describes methods and processes but does not include structured pseudocode or explicitly labeled algorithm blocks.
Open Source Code Yes Data and code are available at https://github.com/Naylenv/UF-FGTG.
Open Datasets Yes Data and code are available at https://github.com/Naylenv/UF-FGTG. Specifically, we employ Diffusion DB (Wang et al. 2022), a large-scale dataset frequently employed for training in textto-image tasks, as our analysis dataset. We build a coarsefine granularity prompts dataset based on Lexica.art (Santana 2022), which consists of 81,910 fine-grained prompts filtered and extracted from user communities.
Dataset Splits No We split 73,718 data pairs as the training set and 8,192 data instances as the testing set. The paper does not explicitly mention a validation set split.
Hardware Specification Yes We conduct our experiments on NVIDIA A100 GPUs.
Software Dependencies Yes We utilize the CLIP model as the fine-grained text encoder and the T5 model (Raffel et al. 2020) as the text decoder to articulate our methodology... The fine-grained text encoder is initialized using Open CLIP (Cherti et al. 2023) derived from the Stable Diffusion model (Rombach et al. 2022). The text decoder DT is initialized with a FLAN-T5 (Chung et al. 2022) pretrained generative language model.
Experiment Setup Yes The parameters used for image generation include step , seed , height , width , CFG scale , and sampler ... During training, we train the fine-grained text encoder, domain adapter, and adaptive feature extraction module on our CFP dataset for 100 epochs, using the Adam W optimizer (Loshchilov and Hutter 2018), a learning rate of 5e-5, and a batch size of 16... For the image generation phase, we utilize Stable Diffusion-v2.1, setting the CFG scale to 7, and perform 50 denoising steps using the Euler Ancestral sampler (Karras et al. 2022).