DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation

Authors: Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, Wenwu Zhu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our proposed Disen Booth framework outperforms baseline models for subject-driven text-to-image generation with the identity-preserved embedding.
Researcher Affiliation Academia 1Department of Computer Science and Technology, Tsinghua University 2Beijing National Research Center for Information Science and Technology 3Lanzhou University
Pseudocode No The paper describes its methods through text and equations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes 1Our code is available at https://github.com/forchchch/Disen Booth
Open Datasets Yes We adopt the subject-driven text-to-image generation dataset Dream Bench proposed by Ruiz et al. (2022), which are downloaded from Unsplash2. This dataset contains 30 subjects, including unique objects like backpacks, stuffed animals, cats, etc.
Dataset Splits No The paper describes using a small set of images for finetuning (3-5 images per subject) and the Dream Bench dataset for evaluation, but it does not specify explicit train/validation/test splits for the Dream Bench dataset or for the finetuning process in general.
Hardware Specification Yes The finetuning process is conducted on one Tesla V100 with batch size of 1, while the finetuning iterations are 3,000.
Software Dependencies No The paper mentions implementing based on 'Stable Diffusion 2-1' and using 'Adam W' optimizer, but it does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes The learning rate is 1e-4 with the Adam W (Loshchilov & Hutter, 2018) optimizer. The finetuning process is conducted on one Tesla V100 with batch size of 1, while the finetuning iterations are 3,000. As for the Lo RA rank, we use r = 4 for all the experiments. We use λ2 = 0.01 for all our experiments. λ3 is a hyper-parameter which is set to 0.001 for all our experiments.