Improved Techniques for Training Consistency Models
Authors: Yang Song, Prafulla Dhariwal
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To tackle these challenges, we present improved techniques for consistency training, where consistency models learn directly from data without distillation. Combined with better hyperparameter tuning, these modifications enable consistency models to achieve FID scores of 2.51 and 3.25 on CIFAR-10 and Image Net 64 ˆ 64 respectively in a single sampling step. These scores mark a 3.5ˆ and 4ˆ improvement compared to prior consistency training approaches. Through two-step sampling, we further reduce FID scores to 2.24 and 2.77 on these two datasets, surpassing those obtained via distillation in both one-step and two-step settings, while narrowing the gap between consistency models and other state-of-the-art generative models. |
| Researcher Affiliation | Industry | Yang Song & Prafulla Dhariwal Open AI {songyang,prafulla}@openai.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access information (e.g., repository link, explicit statement of code release) for open-source code related to the methodology. |
| Open Datasets | Yes | All models are trained on the CIFAR-10 dataset (Krizhevsky et al., 2014) without class labels. We observe similar improvements on other datasets, including Image Net 64ˆ64 (Deng et al., 2009). |
| Dataset Splits | No | The paper mentions training on CIFAR-10 and ImageNet 64x64, but does not explicitly provide details about training/validation/test splits, such as percentages or specific sample counts for a validation set. |
| Hardware Specification | Yes | All models are trained on a cluster of Nvidia A100 GPUs. |
| Software Dependencies | No | The paper mentions using the RAdam optimizer and specific architectures (NCSN++, ADM) but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We train all models with the RAdam optimizer (Liu et al., 2019) using learning rate 0.0001. All CIFAR-10 models are trained for 400,000 iterations, whereas Image Net 64 ˆ 64 models are trained for 800,000 iterations. For CIFAR-10 models in Section 3, we use batch size 512 and EMA decay rate 0.9999 for the student network. For i CT and i CT-deep models in Table 2, we use batch size 1024 and EMA decay rate of 0.99993 for CIFAR-10 models, and batch size 4096 and EMA decay rate 0.99997 for Image Net 64 ˆ 64 models. We use a dropout rate of 0.3 for all consistency models on CIFAR-10. For Image Net 64 ˆ 64, we use a dropout rate of 0.2. |