Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
Authors: Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, Kwan-Yee K. Wong
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through both quantitative and qualitative comparisons, Uni-Control Net demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. ... To evaluate the generation quality, We report the FID [58] in Table 2. We also use quantitative metrics to assess the controllability. We employed the following metrics for single-condition generation: SSIM (Structural Similarity) for Canny, HED, MLSD, and sketch conditions, m AP (mean Average Precision) based on OKS (Object Keypoint Similarity) for pose condition, MSE (Mean Squared Error) for depth map, m Io U (Mean Intersection over Union) for segmentation map, CLIP score for content condition. ... For ablation study, we fine-tune our model using a smaller dataset for resource consideration. |
| Researcher Affiliation | Collaboration | Shihao Zhao The University of Hong Kong shzhao@cs.hku.hk Dongdong Chen Microsoft cddlyf@gmail.com Yen-Chun Chen Microsoft yen-chun.chen@microsoft.com Jianmin Bao Microsoft jianmin.bao@microsoft.com Shaozhe Hao The University of Hong Kong szhao@cs.hku.hk Lu Yuan Microsoft luyuan@microsoft.com Kwan-Yee K. Wong The University of Hong Kong kykwong@cs.hku.hk |
| Pseudocode | No | The paper describes its model architecture and processes in text and figures, but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/Shihao Zhao ZSH/Uni-Control Net. |
| Open Datasets | Yes | To fine-tune our model, we randomly sample 10 million text-image pairs from the LAION dataset [53] and fine-tune Uni-Control Net for 1 epoch. ... For quantitative evaluation, we use the validation set of COCO2017 [57] at a resolution of 512 × 512. |
| Dataset Splits | Yes | For quantitative evaluation, we use the validation set of COCO2017 [57] at a resolution of 512 × 512. Since this set contains 5k images, and each image has multiple captions, we randomly select one caption per image resulting in 5k generated images for our evaluation. |
| Hardware Specification | No | The paper mentions 'massive GPU resources' in the context of Composer [20], a related work, but does not specify the hardware used for its own experiments (e.g., specific GPU models, CPUs, or cluster details). |
| Software Dependencies | No | The paper mentions software components such as Stable Diffusion (SD), UNet, CLIP, DDIM, and Adam W optimizer, but it does not provide specific version numbers for these or other software libraries (e.g., PyTorch, TensorFlow, Python version) used for replication. |
| Experiment Setup | Yes | To fine-tune our model, we randomly sample 10 million text-image pairs from the LAION dataset [53] and fine-tune Uni-Control Net for 1 epoch. We use the Adam W optimizer [54] with a learning rate of 1 × 10−5 and resize the input images and local condition maps to 512 × 512. As described, the local and global control adapters are fine-tuned separately by default. During inference, we merge the two adapters and adopt DDIM [55] for sampling, with the number of time steps set to 50 and the classifier free guidance scale [56] set to 7.5. During training, the hyper-parameter λ in Equation 6 is with a fixed value 1. At inference time, when there is no text prompt, λ remains at 1, while when there is a text prompt, the value is adjusted to around 0.75, depending on the intented weight between the text and global condition. |