reproducibilityindex.ai

Multistep Distillation of Diffusion Models via Moment Matching

Authors: Tim Salimans, Thomas Mensink, Jonathan Heek, Emiel Hoogeboom

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By using up to 8 sampling steps, we obtain distilled models that outperform not only their one-step versions but also their original many-step teacher models, obtaining new state-of-the-art results on the Imagenet dataset. We evaluate our proposed methods in the class-conditional generation setting on the Image Net dataset.
Researcher Affiliation	Industry	Google Deep Mind, Amsterdam
Pseudocode	Yes	Algorithm 1 Ancestral sampling algorithm used for both standard denoising diffusion models as well as our distilled models. ... Algorithm 2 Moment matching algorithm with alternating optimization of generator gη and auxiliary denoising model gφ. ... Algorithm 3 Parameter-space moment matching algorithm with instant denoising model gφ.
Open Source Code	No	We are currently unable to share code but hope to be able to do so in the future.
Open Datasets	Yes	We evaluate our proposed methods in the class-conditional generation setting on the Image Net dataset (Deng et al., 2009) ... In Table 3 we report zero-shot FID (Heusel et al., 2017) and CLIP Score (Radford et al., 2021) on MS-COCO (Lin et al., 2014).
Dataset Splits	Yes	We distill our models for a maximum of 200,000 steps at batch size 2048, calculating FID every 5,000 steps. We report the optimal FID seen during the distillation process, keeping evaluation data and random seeds ﬁxed across evaluations to minimize bias.
Hardware Specification	Yes	All experiments were run on TPUv5e, using 256 chips per experiment.
Software Dependencies	No	We use the Adam optimizer (Kingma & Ba, 2014) with β1 = 0, β2 = 0.99, ϵ = 1e-12.
Experiment Setup	Yes	We distill our models for a maximum of 200,000 steps at batch size 2048... We use the Adam optimizer (Kingma & Ba, 2014) with β1 = 0, β2 = 0.99, ϵ = 1e-12. We use learning rate warmup for the first 1,000 steps and then linearly anneal the learning rate to zero over the remainder of the optimization steps. We use gradient clipping with a maximum norm of 1.