Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion models

Authors: Kyungmin Lee, Sangkyung Kwak, Kihyuk Sohn, Jinwoo Shin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct an extensive empirical study on the personalization of T2I diffusion models (see Fig. 1). We show that the proposed method improves upon baselines that uses regular fine-tuning objective or using prior preservation loss [10] in generation of custom subject with various visual attributes or with the known style of pretrained model. To be specific, we show that our method positions on the superior Pareto frontier than baselines (e.g., see Fig. 5a).
Researcher Affiliation Collaboration Kyungmin Lee1 Sangkyung Kwak1 Kihyuk Sohn2 Jinwoo Shin1 1KAIST 2Meta Reality Labs {kyungmnlee, skkwak9806, jinwoos}@kaist.ac.kr kihyuk.sohn@gmail.com
Pseudocode Yes Algorithm 1 Regular fine-tuning Require: Dataset Dref, fine-tuning model ϵθ, learning rate η ą 0 1: while not converged do 2: Sample px, cq Dref 3: Sample ϵ Np0, Iq 4: Sample t Up0, 1q 5: zt Ð αtx σtϵ 6: LDMpθq Ð }ϵθpzt; c, tq ϵ}2 2 7: Update θ Ð θ η θLDMpθq 8: end while Algorithm 2 Fine-tuning with DCO loss Require: Dataset Dref, fine-tuning model ϵθ, pretrained model ϵϕ, temperature βt ą 0, learning rate η ą 0 1: while not converged do 2: Sample px, cq Dref 3: Sample ϵ Np0, Iq 4: Sample t Up0, 1q 5: zt Ð αtx σtϵ 6: ℓpθq Ð }ϵθpzt; c, tq ϵ}2 2 7: ℓpϕq Ð }ϵϕpzt; c, tq ϵ}2 2 (no gradient) 8: LDCOpθq Ð log σ βt ℓpθq ℓpϕq 9: Update θ Ð θ η θLDCOpθq 10: end while
Open Source Code No The authors state in the NeurIPS Paper Checklist (Question 5) that they do not provide open access to data and code at this time: 'We will make a related decision for this after the acceptance.'
Open Datasets Yes We use Dream Booth dataset [10] for subject personalization which contains 30 subjects, including pets and unique objects such as backpack, dogs, plushie, etc. [...] Similarily, we use 10 images from Style Drop dataset [11] for style personalization, where examples are presented in Fig. 12. License. The license for Dream Booth dataset can be found in here. Also, the attributes for style images can be found in Style Drop [11] paper as well as here.
Dataset Splits No The paper mentions using 'Dream Booth dataset' and 'Style Drop dataset' but does not specify the exact percentages or counts for training, validation, or test splits. It refers to 'fine-tuning' and 'evaluation' phases but not the data partitioning.
Hardware Specification Yes Each training is performed on a single A100 40GB GPU using a batch size of 1.
Software Dependencies No The paper states 'We use Py Torch and Huggingface Diffusers library for our codebase' but does not specify their version numbers.
Experiment Setup Yes For all experiments, we fine-tune Lo RA of rank 32 and textual embeddings using Adam [52] optimizer with learning rates of 5e-5 and 5e-4, respectively. We use constant βt=1000 for DCO loss. [...] We fine-tune Lo RA of rank 64 using Adam optimizer with a learning rate of 5e-5 and do not train textual embedding. [...] We use DDIM [41] scheduler with 50 steps, and use CFG guidance scale of 7.5 throughout experiments.