Non-confusing Generation of Customized Concepts in Diffusion Models

Authors: Wang Lin, Jingyuan Chen, Jiaxin Shi, Yichen Zhu, Chen Liang, Junzhong Miao, Tao Jin, Zhou Zhao, Fei Wu, Shuicheng Yan, Hanwang Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the effectiveness of CLIF in preventing the confusion of multi-customized concept generation.
Researcher Affiliation Collaboration 1Zhejiang University 3Huawei Cloud Computing 4Tsinghua University 5Harbin Institute of Technology 6Skywork AISingapore 7Nanyang Technological University.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No Project page: https://clif-official.github.io/clif. The paper provides a link to a 'project page' but does not explicitly state that the source code for the methodology is available there.
Open Datasets No The paper states, 'We curate a dataset consisting of 18 representative characters,' and describes its construction from collected images and augmented data, but it does not provide any concrete access information (link, DOI, repository, or citation) for this dataset to be publicly available.
Dataset Splits No The paper describes the data generation process for evaluation and specifies image counts for single and multi-concept scenarios (e.g., 'This yields a total of 1,000 images for single-concept.'), but it does not provide explicit training, validation, or test dataset splits for the input data used to train the models.
Hardware Specification Yes The process of tuning concept embeddings in text encoder typically requires approximately 4-5 hours using four Nvidia-A100 GPUs, accounting for variations in data volume.
Software Dependencies No The paper mentions software tools like 'CLIP-Score toolkit' and models like 'Stable Diffusion,' 'SAM,' and 'GPT-4,' but does not provide specific version numbers for these software dependencies (e.g., 'CLIP-Score toolkit version X.Y').
Experiment Setup Yes The implementation process for the text encoder involves fine-tuning it on augmented data, following a similar approach as CLIP, with a learning rate of 1e-4. ... As part of the Lo RA tuning, we integrate the Lo RA layer into the linear layer within all attention modules of the U-net, utilizing a rank of r = 8. The Adam optimizer is utilized for both text embeddings and diffusion model parameters, with a learning rate of 2e-4. All experiments and evaluations make use of the DDPM with 50 sampling steps.