Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation

Authors: Minghui Hu, Jianbin Zheng, Daqing Liu, Chuanxia Zheng, Chaoyue Wang, Dacheng Tao, Tat-Jen Cham

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we delve into a comprehensive experimental analysis to validate the efficacy and superiority of the proposed method through ablation studies and application demonstrations. Subsequently, in Sec. 4.1, we put forth both quantitative and qualitative results, elucidating the comparative advantages of our approach.
Researcher Affiliation Collaboration Nanyang Technological University, South China University of Technology, University of Oxford, The University of Sydney, JD Explore Academy
Pseudocode No The paper does not contain a dedicated 'Pseudocode' or 'Algorithm' section, nor does it present any structured algorithm blocks.
Open Source Code Yes The codes are released at https://mhh0318.github.io/cocktail/.
Open Datasets Yes All of our experiments are performed on LAION-AESTHETICS-6.5 dataset, which contains about 600K image-text pairs with predicted aesthetics scores of higher than 6.5.
Dataset Splits No The paper states that experiments are performed on the 'LAION-AESTHETICS-6.5 dataset' and evaluates on 'COCO5k validation set' and 'COCO validation set'. However, it does not provide specific numerical percentages or sample counts for the training, validation, and test splits used on the LAION-AESTHETICS-6.5 dataset, which is necessary for reproducing the experiment's data partitioning.
Hardware Specification Yes trained for 20 epochs with a batch size of 64 on 4 NVIDIA 80G-A100 GPUs within 4 days.
Software Dependencies No The paper mentions 'Stable Diffusion v2.1' as the base model and 'Adam W optimizer' and 'DDIM sampler' as components, but it does not provide specific version numbers for underlying software libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA) required to reproduce the experiment.
Experiment Setup Yes g Control Net is adapted from the pretrained Stable Diffusion v2.1 in this paper and trained for 20 epochs with a batch size of 64 on 4 NVIDIA 80G-A100 GPUs within 4 days. We use the Adam W optimizer with a learning rate of 3.0 e-05. All the training images in the LAION-AESTHETICS-6.5 are first resized to 512 by the short side and then randomly cropped to 512 512. During inference, the sampler is DDIM, the sampling steps are 50, and the classifier-free guidance scale is 9.0 by default.