CCM: Real-Time Controllable Visual Content Creation Using Text-to-Image Consistency Models

Authors: Jie Xiao, Kai Zhu, Han Zhang, Zhiheng Liu, Yujun Shen, Zhantao Yang, Ruili Feng, Yu Liu, Xueyang Fu, Zheng-Jun Zha

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We quantitatively and qualitatively evaluate all strategies across various conditional controls, including sketch, hed, canny, depth, human pose, low-resolution image and masked image, with the pre-trained text-toimage latent consistency models. ... Table 2. Quantitative comparison of different methods. ... Figure 1. Visual comparison of different strategies of adding controls.
Researcher Affiliation Collaboration 1University of Science and Technology of China, Hefei, China 2Alibaba Group 3Shanghai Jiao Tong University 4Ant Group.
Pseudocode No The paper contains mathematical formulations and architectural diagrams, but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any specific links or explicit statements about the availability of open-source code for the described methodology.
Open Datasets Yes We train on a combination of public datasets, including Image Net21K (Russakovsky et al., 2015), Web Vision (Li et al., 2017), and a filter version of LAION dataset (Schuhmann et al., 2022).
Dataset Splits No The paper specifies the datasets used and some training parameters, but it does not provide explicit training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification Yes This training process costs about 160 A100 GPU days. ... We report the the number of function evaluations (NFEs) and measure the time consumption on a single A100 GPU.
Software Dependencies No The paper mentions various models and algorithms used (e.g., 'pre-trained edge detection model', 'canny edge detector', 'Midas model'), but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions).
Experiment Setup Yes To train the foundational consistency model, we set θ = stopgrad (θ), N = 200, CFG = 5.0, and λ (tn) = 1.0 for all n U([1, N 1]). ... The batch size is 128 and the learning rate is 8e 6. The image resolution is 1024 1024. ... For each Control Net, the total training process involves 100K training steps and the batch size is 32.