Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Authors: Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, Kwan-Yee K. Wong

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through both quantitative and qualitative comparisons, Uni-Control Net demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. ... To evaluate the generation quality, We report the FID [58] in Table 2. We also use quantitative metrics to assess the controllability. We employed the following metrics for single-condition generation: SSIM (Structural Similarity) for Canny, HED, MLSD, and sketch conditions, m AP (mean Average Precision) based on OKS (Object Keypoint Similarity) for pose condition, MSE (Mean Squared Error) for depth map, m Io U (Mean Intersection over Union) for segmentation map, CLIP score for content condition. ... For ablation study, we fine-tune our model using a smaller dataset for resource consideration.
Researcher Affiliation Collaboration Shihao Zhao The University of Hong Kong shzhao@cs.hku.hk Dongdong Chen Microsoft cddlyf@gmail.com Yen-Chun Chen Microsoft yen-chun.chen@microsoft.com Jianmin Bao Microsoft jianmin.bao@microsoft.com Shaozhe Hao The University of Hong Kong szhao@cs.hku.hk Lu Yuan Microsoft luyuan@microsoft.com Kwan-Yee K. Wong The University of Hong Kong kykwong@cs.hku.hk
Pseudocode No The paper describes its model architecture and processes in text and figures, but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/Shihao Zhao ZSH/Uni-Control Net.
Open Datasets Yes To fine-tune our model, we randomly sample 10 million text-image pairs from the LAION dataset [53] and fine-tune Uni-Control Net for 1 epoch. ... For quantitative evaluation, we use the validation set of COCO2017 [57] at a resolution of 512 × 512.
Dataset Splits Yes For quantitative evaluation, we use the validation set of COCO2017 [57] at a resolution of 512 × 512. Since this set contains 5k images, and each image has multiple captions, we randomly select one caption per image resulting in 5k generated images for our evaluation.
Hardware Specification No The paper mentions 'massive GPU resources' in the context of Composer [20], a related work, but does not specify the hardware used for its own experiments (e.g., specific GPU models, CPUs, or cluster details).
Software Dependencies No The paper mentions software components such as Stable Diffusion (SD), UNet, CLIP, DDIM, and Adam W optimizer, but it does not provide specific version numbers for these or other software libraries (e.g., PyTorch, TensorFlow, Python version) used for replication.
Experiment Setup Yes To fine-tune our model, we randomly sample 10 million text-image pairs from the LAION dataset [53] and fine-tune Uni-Control Net for 1 epoch. We use the Adam W optimizer [54] with a learning rate of 1 × 10−5 and resize the input images and local condition maps to 512 × 512. As described, the local and global control adapters are fine-tuned separately by default. During inference, we merge the two adapters and adopt DDIM [55] for sampling, with the number of time steps set to 50 and the classifier free guidance scale [56] set to 7.5. During training, the hyper-parameter λ in Equation 6 is with a fixed value 1. At inference time, when there is no text prompt, λ remains at 1, while when there is a text prompt, the value is adjusted to around 0.75, depending on the intented weight between the text and global condition.