One-Step Diffusion Distillation through Score Implicit Matching

Authors: Weijian Luo, Zemin Huang, Zhengyang Geng, J. Zico Kolter, Guo-Jun Qi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of SIM compared to previous approaches, using different choices of distance functions to define the divergence. Most relatedly, we compare SIM with the Diff-Instruct (DI) [43] method, which uses a KL-based divergence term, and the Score Identity Distillation (Si D) method [93], which we show to be a special case of our approach when the distance function is simply chosen to be the squared L2 distance (though derived in an entirely different fashion). We also show empirically that SIM with a specially-designed Pseudo-Huber distance function shows faster convergences and stronger robustness to hyper-parameters than L2 distance, making the resulting method substantially strong than previous approaches. Finally, we show that SIM obtains very strong empirical performance in absolute terms relative to past work in the field on CIFAR10 image generation and text-to-image generation. On the CIFAR10 dataset, SIM shows a one-step generative performance with a Frechet Inception Distance (FID) of 2.06 for unconditional generation and 1.96 for class-conditional generation.
Researcher Affiliation Academia Weijian Luo Peking University luoweijian@stu.pku.edu.cn Zemin Huang Westlake University huangzemin@westlake.edu.cn Zhengyang Geng Carnegie Mellon University zgeng2@cs.cmu.edu J. Zico Kolter Carnegie Mellon University zkolter@cs.cmu.edu Guo-jun Qi Westlake University guojunq@gmail.com
Pseudocode Yes Algorithm 1: Score Implicit Matching for Diffusion Distillation. (Pseudo-code in Appendix A.2) Input: pre-trained DM sqt(.), generator gθ, prior distribution pz, online DM sψ(.); differentiable distance function d(.), and forward diffusion (2.1).
Open Source Code No Since our code has a business policy, we can not release the code at this time. But we plan to release the code once the acceptance.
Open Datasets Yes On the CIFAR10 dataset, it achieves an FID of 2.06 for unconditional generation and 1.96 for class-conditional generation. ... We utilized the SAM-LLaVA-Caption10M dataset... We also compare the SIM-Di T-600M and Pixel Art-α with other few-step models... on the widely used COCO-2017 validation dataset.
Dataset Splits Yes On the SAM-LLaVA-Caption10M, which is one of the datasets the original Pixel Art-α model is trained on... We also compare the SIM-Di T-600M and Pixel Art-α with other few-step models... on the widely used COCO-2017 validation dataset.
Hardware Specification Yes Our best model is trained (data-freely) with 4 A100-80G GPUs for 2 days... All experiments in this section were conducted on 4 A100-40G GPUs with bfloat16 precision.
Software Dependencies No The paper mentions using 'PyTorch style pseudo-code' and optimizers like 'Adam', but it does not specify exact version numbers for these software components or any other libraries.
Experiment Setup Yes Table 5: Hyperparameters used for SIM on CIFAR10 EDM Distillation. Learning rate, Batch size, Adam β0, Adam β1. ... All experiments in this section were conducted on 4 A100-40G GPUs with bfloat16 precision, using the Pix Art-XL-2-512x512 model version, employing the same hyperparameters. For both optimizers, we utilized Adam with a learning rate of 5e-6 and betas=[0, 0.999]. Additionally, to enable a batch size of 1024, we employed gradient checkpointing and set the gradient accumulation to 8. Finally, regarding the training noise distribution... Our best model was trained on the SAM Caption dataset for approximately 16k iterations, which is equivalent to less than 2 epochs.