Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

Authors: Mingyang Yi, Aoxue Li, Yi Xin, Zhenguo Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through an experiment (details are in Section 4), we find that in the early stage of the denoising process, the overall shapes of generated images (latent) are already reconstructed. We both empirically show and theoretically explain that in contrast to the high-frequency signals, the low-frequency signals of noisy data are not corrupted until the end stage of the forward noise-adding process. Finally, we apply our observations in one practical cases: Training-free sampling acceleration, as the text prompt works in the first stage of denoising process, we remove the textual prompt-related model propagation (ϵθ(t, xt, C) in (3)) during the details reconstruction stage, which merely change the generated images but save about 25%+ inference cost. We summarize our contributions as follows. 1. We show, during the denoising process of the stable diffusion model, the overall shape and details of generated images are respectively reconstructed in the early and final stages of it. 2. For the working mechanism of text prompt, we empirically show the special token [EOS] dominates the influence of text prompt in the early (overall shape reconstruction) stage of denoising process, when the information from text prompt is also conveyed. Subsequently, the model works on filling the details of generated images mainly depending on themselves. 3. We apply our observation to accelerate the sampling of denoising process 25%+.
Researcher Affiliation Collaboration Mingyang Yi1 , Aoxue Li2 , Yi Xin3 , Zhenguo Li2 1 Renmin University of China 2 Huawei Noah s Ark Lab 3 Nanjing University
Pseudocode No The paper provides mathematical equations and descriptive steps for its processes but does not include structured pseudocode or algorithm blocks with explicit labels like 'Pseudocode' or 'Algorithm'.
Open Source Code No The NeurIPS Paper Checklist, provided by the authors, states: 'We will release the code in future.'
Open Datasets Yes Concretely, we apply noise prediction (9) with varied a to generate 30K images from 30K text prompts in the test set of MS-COCO [20], for each sampler and backbone model. We consider backbone models Stable-Diffusion (SD) v1.5-Base, SD v2.1-Base [31], and Pixart-Alpha [5]. As in [14], we use 1600 prompts following the template a {attribute} {noun}, with the attribute as an adjective of color or texture. We create 800 text prompts respectively under each of the two categories of attributes. Besides that, we add another extra 1000 complex natural prompts in [14] without a predefined sentence template. These prompts consist of the text prompts set (abbrev Prompt Set) we used.
Dataset Splits No The paper explicitly defines the 'Prompt Set' used for generating images and evaluates on the 'test set of MS-COCO'. However, it does not specify a distinct training/validation/test split for the MS-COCO dataset used in *their* experiments, as they use pre-trained models and evaluate on a test set, rather than performing a fresh training run that would typically involve such splits.
Hardware Specification Yes Table 2: Saved Latency... evaluated on one V100 GPU. Acknowledgement: We gratefully acknowledge the support of Mindspore, CANN(Compute Architecture for Neural Networks) and Ascend AI Processor used for this research.
Software Dependencies No The paper mentions specific models like 'Stable Diffusion v1.5-Base', 'SD v2.1-Base', 'Pixart-Alpha', and samplers 'DDIM' and 'DPM-Solver', but does not provide specific version numbers for underlying software libraries, programming languages, or frameworks (e.g., PyTorch version, Python version).
Experiment Setup Yes Finally, without specification, we use 50 steps DDIM sampling [36]. Concretely, for a as the starting step of removing text prompt, i.e., during t [0, a), we use w = 7.5, and w = 0 for t [a, 50], where a [0, 50]. To evaluate the saved computational cost of using noise prediction (9) during inference and the quality of generated data, we consider applying it on two standard samplers DDIM [36] and DPM-Solver [22] on a benchmark dataset MS-COCO [20] in T2I generation. We consider backbone models Stable-Diffusion (SD) v1.5-Base, SD v2.1-Base [31], and Pixart-Alpha [5].