Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models

Authors: Sangwon Jang, Jaehyeong Jo, Kimin Lee, Sung Ju Hwang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our Mu DI can produce high-quality personalized images without identity mixing, even for highly similar subjects as shown in Figure 1. Specifically, in human evaluation, Mu DI obtains twice the success rate for personalizing multiple subjects without identity mixing over existing baselines and is preferred over 70% against the strongest baseline.
Researcher Affiliation Collaboration Sangwon Jang ,1, Jaehyeong Jo ,1, Kimin Lee ,1, Sung Ju Hwang ,1,2 Equal contribution Equal advising KAIST1, Deep Auto.ai2
Pseudocode Yes We provide the detailed procedures of our method in Algorithm 1. ... The proposed inference method is summarized in Algorithm 2. ... We summarize the process of measuring the D&C score in Algorithm 3.
Open Source Code Yes Our project page is at https://mudi-t2i.github.io/. ... We have submitted our code for the experiments as supplementary materials.
Open Datasets Yes We collected images from the Dream Bench dataset [42] and the Custom Concept101 dataset [21], consisting of diverse categories from animals to objects and scenes.
Dataset Splits No The paper describes creating a new dataset for evaluation and conducting human evaluations, but it does not specify explicit training/validation/test splits for the data used to train their model (e.g., percentages or counts for validation sets).
Hardware Specification Yes After the preprocessing step, the training of Mu DI takes almost the same duration as Dream Booth [42], taking about 90 minutes to personalize two subjects on a single RTX 3090 GPU.
Software Dependencies No The paper mentions using 'Stable Diffusion XL (SDXL) [36]', 'Lo RA [18]', 'U-Net [41] module', and 'Adam W optimizer [26]'. However, it does not provide specific version numbers for these software components or libraries, which are necessary for reproducible descriptions.
Experiment Setup Yes For all experiments, we use Stable Diffusion XL (SDXL) [36] as the pre-trained text-to-image diffusion model and employ a Lo RA [18] with a rank of 32 for U-Net [41] module. ... We determine the training iterations of Seg-Mix... For example... 1400 to 1600 training iterations... about 1200 iterations. We use a fixed augmentation probability of 0.3... To prevent subject overfitting, we use 1000 training iterations for Dream Booth. ... We use Adam W optimizer [26] with β1 = 0.9, β2 = 0.999, weight decay of 0.0001, and a learning rate of 1e-4, following the setting of Dream Booth [42], and set the batch size to 2.