Phased Consistency Models

Authors: Fu-Yun Wang, Zhaoyang Huang, Alexander Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, Xiaogang Wang, Hongsheng Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations demonstrate that PCMs outperform LCMs across 1 16 step generation settings. While PCMs are specifically designed for multi-step refinement, they achieve comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show the methodology of PCMs is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator.
Researcher Affiliation Collaboration 1MMLab, CUHK 2Avolution AI 3Hedra 4Shanghai AI Lab 5Sensetime Research 6Stanford University 7CPII under Inno HK
Pseudocode Yes VI.2 Pseudo Training Code
Open Source Code Yes Our code is available at https://github.com/G-U-N/Phased-Consistency-Model.
Open Datasets Yes Dataset. Training dataset: For image generation, we train all models on the CC3M [5] dataset. For video generation, we train the model on Web Vid-2M [2]. Evaluation dataset: For image generation, we evaluate the performance on the COCO-2014 [28] following the 30K split of karpathy. We also evaluate the performance on the CC12M with our randomly chosen 30K split. For video generation, we evaluate with the captions of UCF-101 [58].
Dataset Splits Yes Evaluation dataset: For image generation, we evaluate the performance on the COCO-2014 [28] following the 30K split of karpathy. We also evaluate the performance on the CC12M with our randomly chosen 30K split. For video generation, we evaluate with the captions of UCF-101 [58]. We report the FID [14] and CLIP score [43] of the generated images and the validation 30K-sample splits.
Hardware Specification Yes We achieve state-of-the-art few-step text-to-image generation and text-to-video generation with only 8 A 800 GPUs, indicating the advancements of our method.
Software Dependencies No The paper mentions training on Stable Diffusion v1-5 and Stable Diffusion XL, and using LoRA, but does not specify software versions for programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA.
Experiment Setup Yes Specifically, for multi-step models, we trained Lo RA with a rank of 64. For models based on SD v1-5, we used a learning rate of 5e-6, a batch size of 160, and trained for 5k iterations. For models based on SDXL, we used a learning rate of 5e-6, a batch size of 80, and trained for 10k iterations. We did not use EMA for Lo RA training.