Non-confusing Generation of Customized Concepts in Diffusion Models
Authors: Wang Lin, Jingyuan Chen, Jiaxin Shi, Yichen Zhu, Chen Liang, Junzhong Miao, Tao Jin, Zhou Zhao, Fei Wu, Shuicheng Yan, Hanwang Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate the effectiveness of CLIF in preventing the confusion of multi-customized concept generation. |
| Researcher Affiliation | Collaboration | 1Zhejiang University 3Huawei Cloud Computing 4Tsinghua University 5Harbin Institute of Technology 6Skywork AISingapore 7Nanyang Technological University. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Project page: https://clif-official.github.io/clif. The paper provides a link to a 'project page' but does not explicitly state that the source code for the methodology is available there. |
| Open Datasets | No | The paper states, 'We curate a dataset consisting of 18 representative characters,' and describes its construction from collected images and augmented data, but it does not provide any concrete access information (link, DOI, repository, or citation) for this dataset to be publicly available. |
| Dataset Splits | No | The paper describes the data generation process for evaluation and specifies image counts for single and multi-concept scenarios (e.g., 'This yields a total of 1,000 images for single-concept.'), but it does not provide explicit training, validation, or test dataset splits for the input data used to train the models. |
| Hardware Specification | Yes | The process of tuning concept embeddings in text encoder typically requires approximately 4-5 hours using four Nvidia-A100 GPUs, accounting for variations in data volume. |
| Software Dependencies | No | The paper mentions software tools like 'CLIP-Score toolkit' and models like 'Stable Diffusion,' 'SAM,' and 'GPT-4,' but does not provide specific version numbers for these software dependencies (e.g., 'CLIP-Score toolkit version X.Y'). |
| Experiment Setup | Yes | The implementation process for the text encoder involves fine-tuning it on augmented data, following a similar approach as CLIP, with a learning rate of 1e-4. ... As part of the Lo RA tuning, we integrate the Lo RA layer into the linear layer within all attention modules of the U-net, utilizing a rank of r = 8. The Adam optimizer is utilized for both text embeddings and diffusion model parameters, with a learning rate of 2e-4. All experiments and evaluations make use of the DDPM with 50 sampling steps. |