Align Your Steps: Optimizing Sampling Schedules in Diffusion Models

Authors: Amirmojtaba Sabour, Sanja Fidler, Karsten Kreis

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our novel approach on several image, video as well as 2D toy data synthesis benchmarks, using a variety of different samplers, and observe that our optimized schedules outperform previous hand-crafted schedules in almost all experiments.
Researcher Affiliation Collaboration 1NVIDIA, Toronto, Canada 2Department of computer science, University of Toronto, Toronto, Ontario 3Vector Institute, Toronto, Canada.
Pseudocode Yes A pseudocode is given in App. B.1.
Open Source Code Yes We provide our optimized schedules for Stable Diffusion 1.5, SDXL, Deep Floyd-IF, and Stable Video Diffusion in Table 3. The schedules have been made publicly available2; see App. B.2 . 2We also provide a colab notebook which shows how to use these schedules in practice on our project page.
Open Datasets Yes We evaluate our method on various datasets including 2D toy data, standard image datasets such as CIFAR10 (Krizhevsky et al., 2009), FFHQ (Karras et al., 2019), and Image Net (Deng et al., 2009), large scale text-to-image models widely used by practitioners such as Stable Diffusion (Rombach et al., 2021) and SDXL (Podell et al., 2023), as well as the recent video DM Stable Video Diffusion (Blattmann et al., 2023a).
Dataset Splits No The paper mentions using subsets for optimization and evaluation (e.g., "a subset of 8192 data samples with the time-based importance sampling to work well" and "We made use of the COCO (Lin et al., 2014) dataset to optimize the schedule for the text-to-image models. We used a subset of 10,000 images for this task, and excluded these images during FID evaluation."), but it does not specify explicit train/validation/test dataset splits with percentages, sample counts, or explicit standard split citations for all primary datasets used in the main experiments.
Hardware Specification Yes In our experiments, we used RTX6000 GPUs to carry out the optimization. The FFHQ and CIFAR10 experiments required 4 GPUs for 1.5 hours. The Image Net 256x256 and text-to-image experiments were done with 8 GPUs and took roughly 3-4 hours. Lastly, the Stable Video Diffusion experiments were done with 16 GPUs and took 6 hours.
Software Dependencies No The paper mentions various models (e.g., Stable Diffusion 1.5, SDXL) and solvers (e.g., SDE-DPM-Solver++(2M)), and a colab notebook, but it does not specify software versions (e.g., PyTorch 1.x, CUDA 11.x, Python 3.x) for reproducibility.
Experiment Setup Yes We use 3 different classes of stochastic solvers: Stochastic DDIM (Song et al., 2020a), second-order SDE-DPM-Solver++ (Lu et al., 2022b), and the recently proposed 1st, 2nd, and 3rd order ER-SDE-Solvers (Cui et al., 2023). We also report FID scores for two popular deterministic solvers, namely DDIM (Song et al., 2020a) and DPM-Solver++ (2M) (Lu et al., 2022b). For simplicity, no dynamic thresholding is used (Saharia et al., 2022). For continuous-time models, we initialize the schedule according to the EDM scheme, and for discrete-time models, time-uniform initialization is used.