Progressive Text-to-Image Diffusion with Soft Latent Direction
Authors: YuTeng Ye, Jiale Cai, Hang Zhou, Guanwen Li, Youjia Zhang, Zikai Song, Chenxing Gao, Junqing Yu, Wei Yang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment Baselines and Evaluation. Our experimental comparison primarily concentrates on Single-Stage Generation and Progressive Generation baselines. (1) We refer to Single-Stage Generation methods as those that directly generate images from input text in a single step. Current methods include Stable Diffusion (Rombach et al. 2022), Attend-andexcite (Chefer et al. 2023), and Structured Diffusion (Feng et al. 2022). We compare these methods to analyze the efficacy of our progressive synthesis operation. We employ GPT to construct 500 text prompts that contain diverse objects and relationship types. For evaluation, we follow (Wu et al. 2023) to compute Object Recall, which quantifies the percentage of objects successfully synthesized. Moreover, we measure Relation Accuracy as the percentage of spatial or relational text descriptions that are correctly identified, based on 8 human evaluations. (2) We define Progressive Generation as a multi-turn synthesis and editing process that builds on images from preceding rounds. Our comparison encompasses our comprehensive progressive framework against other progressive methods, which includes Instructbased Diffusion models (Brooks, Holynski, and Efros 2023) and mask-based diffusion models (Rombach et al. 2022; Avrahami, Fried, and Lischinski 2022). To maintain a balanced comparison, we source the same input images from SUN (Xiao et al. 2016) and text descriptions via the GPT API (Open AI 2023). Specifically, we collate five scenarios totaling 25 images from SUN, a dataset that showcases realworld landscapes. Each image is paired with the text description, which ensures: 1. Integration of synthesis, editing, and easing paradigms; 2. Incorporation of a diverse assortment of synthesized objects; 3. Representation of spatial relations (e.g., top, bottom, left, right) and interactional relations (e.g., playing with , wearing ). For evaluation, we utilize Amazon Mechanical Turk (AMT) to assess image fidelity. Each image is evaluated based on the fidelity of the generated objects, their relationships, the execution of editing instructions, and the alignment of erasures with the text descriptions. Images are rated on a fidelity scale from 0 to 2, where 0 represents the lowest quality and 2 signifies the highest. With two evaluators assessing each generated image, the cumulative score for each aspect can reach a maximum of 100. Implementation Details. Our framework builds upon Stable Diffusion (SD) V-1.4. During the Stimulus & Response stage, we assign a weight of δ equals 0.8 in eq. (1), and set t equals 25 and αt equals 40 in eq. (2). We implement the stimulus procedure over the 16 16 attention units and integrate the Iterative Latent Refinement design (Chefer et al. 2023). In the latent fusion stage, the parameter τ is set to a value of 40. Qualitative and Quantitative Results Qualitative and Quantitative Comparisons with Single Generation Baselines. fig. 6 reveals that traditional baseline methods often struggle with object omissions and maintaining spatial and interactional relations. In contrast, our progressive generation process offers enhanced image fidelity and controllability. Additionally, we maintain finer details in the generated images, such as the shadows of the beach chair . Result in table 1 indicates that our method outperforms the baselines in both object recall and relation accuracy. Qualitative and Quantitative Comparisons with Progressive Generation Baselines. In fig. 8, baseline methods often fail to synthesize full objects and may not represent relationships as described in the provided text. Moreover, during editing and erasing operations, these methods tend to produce outputs with compromised quality, showcasing unnatural characteristics. It s worth noting that any missteps or inaccuracies in the initial stages, such as those seen in Instruct Pix2Pix, can cascade into subsequent stages, exacerbating the degradation of results. In contrast, our proposed method consistently yields superior results through every phase. The results in table 2 further cement our method s dominant performance in synthesis, editing, and erasing operations, as underscored by the impressive rating scores. Ablation Study Ablation study of method components is shown in table 3. |
| Researcher Affiliation | Academia | Yuteng Ye, Jiale Cai, Hang Zhou, Guanwen Li, Youjia Zhang, Zikai Song, Chenxing Gao, Junqing Yu, Wei Yang* Huazhong University of Science and Technology, Wuhan, China {yuteng ye, jaile cai, henrryzh, juggle lee, youjiazhang, skyesong, cxg, yjqing, weiyangcs}@hust.edu.cn |
| Pseudocode | No | The paper describes its methods in text and through figures but does not include formal pseudocode blocks or sections explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | The paper does not contain any statement about releasing source code or provide a link to a code repository. |
| Open Datasets | Yes | We source the same input images from SUN (Xiao et al. 2016) and text descriptions via the GPT API (Open AI 2023). |
| Dataset Splits | No | The paper mentions collecting 500 text prompts for evaluation and sourcing images from the SUN dataset, but it does not provide specific percentages or counts for training, validation, and test splits needed to reproduce the experiment's data partitioning. |
| Hardware Specification | No | The paper states: 'The computation is completed in the HPC Platform of Huazhong University of Science and Technology.' This is a general statement about the computing environment and does not provide specific details such as GPU/CPU models, memory, or processor types. |
| Software Dependencies | Yes | Our framework builds upon Stable Diffusion (SD) V-1.4. |
| Experiment Setup | Yes | During the Stimulus & Response stage, we assign a weight of δ equals 0.8 in eq. (1), and set t equals 25 and αt equals 40 in eq. (2). We implement the stimulus procedure over the 16 16 attention units and integrate the Iterative Latent Refinement design (Chefer et al. 2023). In the latent fusion stage, the parameter τ is set to a value of 40. |