How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?

Authors: Jiahua Dong, Wenqi Liang, Hongliu Li, Duzhen Zhang, Meng Cao, Henghui Ding, Salman H. Khan, Fahad Shahbaz Khan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments validate that our CIDM surpasses existing custom diffusion models. The source codes are available at https://github.com/Jiahua Dong/CIFC. 5 Experiments 5.1 Experimental Setups Benchmark Dataset: Motivated by [47, 11, 45], in this paper, we construct a new challenging concept-incremental learning (CIL) dataset including ten continuous textguided concept customization tasks to illustrate the effectiveness of our model under the CIFC setting. In the CIL dataset, seven customization tasks have different object concepts (i.e., V1 dog, V2 duck toy, V3 cat, V4 backpack, V5 teddy bear, V7 dog and V9 cat) from [40, 22], and the remaining three tasks have different style concepts (i.e., V6, V8 and V10 styles) collected from website. Considering the practicality of the CIFC setting, we set about 3 5 text-image pairs for each task. Particularly, we introduce some semantically similar concepts (e.g., V1 and V7 dogs, V3 and V9 cats), making the CIL dataset more challenging under the CIFC setting. Implementation Details: We utilize two popular diffusion models: Stable Diffusion (SD-1.5) [38] and SDXL [33] as the pretrained models to conduct comparison experiments. For fair comparisons, we train all SOTA comparison methods and our model using the same backbone and Adam optimizer, where the initial learning rate is 1.0 10 3 to update textual embeddings, and 1.0 10 4 to optimize the denoising UNet. For the low-rank matrices, we follow [11] to set r = 4. We empirically set γ1 = 0.1, γ2 = 1.0 in Eq. (2), α = 0.1 in Eq. (5), and the training steps are 800. Evaluation Metrics: After learning the final concept customization task under the CIFC setting, we conduct both the qualitative and quantitative evaluations on versatile generation tasks: single/multi-concept customization, custom image editing, and custom style transfer. For the quantitative evaluation, we follow [22] to use textalignment (TA) and image-alignment (IA) as metrics. Specifically, for image-alignment (IA), we use the image encoder of CLIP [34] to evaluate the feature similarity between the synthesized image and original sample. For text-alignment (TA), we utilize the text encoder of CLIP [34] to compute the text-image similarity between the synthesized image and its corresponding prompt. 5.2 Qualitative Comparisons To verify the superiority of our model under the CIFC setting, we introduce extensive qualitative comparisons, including single/multi-concept customization (see Figs. 2 3), custom image editing (see Fig. 4), and custom style transfer (see Fig. 5). 5.3 Quantitative Comparisons To analyze quantitative comparisons between our model and SOTA methods, we follow [11, 22, 45, 47] to introduce 20 evaluation prompts for each concept and generate 50 images for each evaluation prompt, resulting in a total of 1,000 synthesized images. Then quantitative evaluation is conducted on these 1,000 images. As shown in Tabs. 1 2, we can observe that our CIDM outperforms all comparison methods by 1.1% 8.0% in terms of image-alignment (IA) and 1.2% 4.8% in terms of text-alignment (TA). 5.4 Ablation Studies This subsection analyzes the effectiveness of each module in our model: elastic weight aggregation (EWA), contextcontrollable synthesis (CCS), task-specific knowledge (TSP) and task-shared knowledge (TSH) in the concept consolidation loss (CCL). Tab. 3 presents the ablation studies of single-concept customization in terms of IA. When compared with Baseline, the performance of our model improves by 0.2% 3.3% in terms of IA, after we add the proposed TSP, TSH and EWA modules. It demonstrates the effectiveness of our model in resolving the CIFC problem by addressing the catastrophic forgetting and concept neglect.
Researcher Affiliation Collaboration Jiahua Dong1 , Wenqi Liang2 , Hongliu Li3 , Duzhen Zhang1 , Meng Cao1, Henghui Ding4 , Salman Khan1, 5, Fahad Shahbaz Khan1, 6 1Mohamed bin Zayed University of Artificial Intelligence 2Shenyang Institute of Automation, Chinese Academy of Sciences 3The Hong Kong Polytechnic University 4Institute of Big Data, Fudan University 5Australian National University 6Linköping University
Pseudocode Yes Algorithm 1: Algorithm Pipeline of The Proposed CIDM.
Open Source Code Yes The source codes are available at https://github.com/Jiahua Dong/CIFC.
Open Datasets No The paper constructs a new dataset, the CIL dataset, but does not provide a direct link, DOI, or specific repository name for accessing it, nor does it cite a published paper that contains the dataset with proper bibliographic information.
Dataset Splits No The paper mentions generating 1,000 synthesized images for quantitative evaluation and conducting qualitative and quantitative evaluations on single/multi-concept customization, custom image editing, and custom style transfer. However, it does not explicitly provide details about train/validation/test splits, percentages, or sample counts, or reference predefined splits with citations for reproducibility. It only mentions using 20 evaluation prompts for each concept and generating 50 images for each evaluation prompt.
Hardware Specification Yes In this paper, we train our model on two NVIDIA RTX 4090 GPUs.
Software Dependencies No The paper mentions using "Adam optimizer" and "CLIP text encoder", but does not specify version numbers for these or other software libraries or dependencies. It also mentions "Stable Diffusion (SD-1.5) [38] and SDXL [33]" without explicit version numbers for the software itself, only citing the papers.
Experiment Setup Yes For fair comparisons, we train all SOTA comparison methods and our model using the same backbone and Adam optimizer, where the initial learning rate is 1.0 10 3 to update textual embeddings, and 1.0 10 4 to optimize the denoising UNet. For the low-rank matrices, we follow [11] to set r = 4. We empirically set γ1 = 0.1, γ2 = 1.0 in Eq. (2), α = 0.1 in Eq. (5), and the training steps are 800.