Phased Consistency Models
Authors: Fu-Yun Wang, Zhaoyang Huang, Alexander Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, Xiaogang Wang, Hongsheng Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluations demonstrate that PCMs outperform LCMs across 1 16 step generation settings. While PCMs are specifically designed for multi-step refinement, they achieve comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show the methodology of PCMs is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. |
| Researcher Affiliation | Collaboration | 1MMLab, CUHK 2Avolution AI 3Hedra 4Shanghai AI Lab 5Sensetime Research 6Stanford University 7CPII under Inno HK |
| Pseudocode | Yes | VI.2 Pseudo Training Code |
| Open Source Code | Yes | Our code is available at https://github.com/G-U-N/Phased-Consistency-Model. |
| Open Datasets | Yes | Dataset. Training dataset: For image generation, we train all models on the CC3M [5] dataset. For video generation, we train the model on Web Vid-2M [2]. Evaluation dataset: For image generation, we evaluate the performance on the COCO-2014 [28] following the 30K split of karpathy. We also evaluate the performance on the CC12M with our randomly chosen 30K split. For video generation, we evaluate with the captions of UCF-101 [58]. |
| Dataset Splits | Yes | Evaluation dataset: For image generation, we evaluate the performance on the COCO-2014 [28] following the 30K split of karpathy. We also evaluate the performance on the CC12M with our randomly chosen 30K split. For video generation, we evaluate with the captions of UCF-101 [58]. We report the FID [14] and CLIP score [43] of the generated images and the validation 30K-sample splits. |
| Hardware Specification | Yes | We achieve state-of-the-art few-step text-to-image generation and text-to-video generation with only 8 A 800 GPUs, indicating the advancements of our method. |
| Software Dependencies | No | The paper mentions training on Stable Diffusion v1-5 and Stable Diffusion XL, and using LoRA, but does not specify software versions for programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA. |
| Experiment Setup | Yes | Specifically, for multi-step models, we trained Lo RA with a rank of 64. For models based on SD v1-5, we used a learning rate of 5e-6, a batch size of 160, and trained for 5k iterations. For models based on SDXL, we used a learning rate of 5e-6, a batch size of 80, and trained for 10k iterations. We did not use EMA for Lo RA training. |