Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
Authors: Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, Kwan-Yee K. Wong
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through both quantitative and qualitative comparisons, Uni-Control Net demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. ... To evaluate the generation quality, We report the FID [58] in Table 2. We also use quantitative metrics to assess the controllability. We employed the following metrics for single-condition generation: SSIM (Structural Similarity) for Canny, HED, MLSD, and sketch conditions, m AP (mean Average Precision) based on OKS (Object Keypoint Similarity) for pose condition, MSE (Mean Squared Error) for depth map, m Io U (Mean Intersection over Union) for segmentation map, CLIP score for content condition. ... For ablation study, we fine-tune our model using a smaller dataset for resource consideration. |
| Researcher Affiliation | Collaboration | Shihao Zhao The University of Hong Kong EMAIL Dongdong Chen Microsoft EMAIL Yen-Chun Chen Microsoft EMAIL Jianmin Bao Microsoft EMAIL Shaozhe Hao The University of Hong Kong EMAIL Lu Yuan Microsoft EMAIL Kwan-Yee K. Wong The University of Hong Kong EMAIL |
| Pseudocode | No | The paper describes its model architecture and processes in text and figures, but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/Shihao Zhao ZSH/Uni-Control Net. |
| Open Datasets | Yes | To fine-tune our model, we randomly sample 10 million text-image pairs from the LAION dataset [53] and fine-tune Uni-Control Net for 1 epoch. ... For quantitative evaluation, we use the validation set of COCO2017 [57] at a resolution of 512 × 512. |
| Dataset Splits | Yes | For quantitative evaluation, we use the validation set of COCO2017 [57] at a resolution of 512 × 512. Since this set contains 5k images, and each image has multiple captions, we randomly select one caption per image resulting in 5k generated images for our evaluation. |
| Hardware Specification | No | The paper mentions 'massive GPU resources' in the context of Composer [20], a related work, but does not specify the hardware used for its own experiments (e.g., specific GPU models, CPUs, or cluster details). |
| Software Dependencies | No | The paper mentions software components such as Stable Diffusion (SD), UNet, CLIP, DDIM, and Adam W optimizer, but it does not provide specific version numbers for these or other software libraries (e.g., PyTorch, TensorFlow, Python version) used for replication. |
| Experiment Setup | Yes | To fine-tune our model, we randomly sample 10 million text-image pairs from the LAION dataset [53] and fine-tune Uni-Control Net for 1 epoch. We use the Adam W optimizer [54] with a learning rate of 1 × 10−5 and resize the input images and local condition maps to 512 × 512. As described, the local and global control adapters are fine-tuned separately by default. During inference, we merge the two adapters and adopt DDIM [55] for sampling, with the number of time steps set to 50 and the classifier free guidance scale [56] set to 7.5. During training, the hyper-parameter λ in Equation 6 is with a fixed value 1. At inference time, when there is no text prompt, λ remains at 1, while when there is a text prompt, the value is adjusted to around 0.75, depending on the intented weight between the text and global condition. |