Hollowed Net for On-Device Personalization of Text-to-Image Diffusion Models

Authors: Wonguk Cho, Seokeon Choi, Debasmit Das, Matthias Reisser, Taesup Kim, Sungrack Yun, Fatih Porikli

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Quantitative and qualitative analyses demonstrate that our approach not only reduces training memory to levels as low as those required for inference but also maintains or improves personalization performance compared to existing methods. 5 Experiments
Researcher Affiliation Collaboration 1Qualcomm AI Research 2Seoul National University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The code would be available after an internal review process has been completed (not available at this submission time).
Open Datasets Yes We use a total of 131 subjects for experiments, utilizing both the Dream Booth [5] and Custom Concept101 [7] datasets.
Dataset Splits No The paper describes using Dream Booth and Custom Concept101 datasets for experiments and fine-tuning but does not explicitly provide training/validation/test dataset splits within its text.
Hardware Specification No The paper frequently mentions 'GPU memory' and 'computational resources' and discusses memory usage in GB (e.g., '3.88GB of GPU memory usage'), but it does not specify any particular GPU or CPU models used for the experiments.
Software Dependencies No The paper mentions 'Adam W optimizer' and 'Stable Diffusion v2.1 diffusion model', but it does not specify any software dependencies with version numbers.
Experiment Setup Yes Following Dream Booth [5], we use a prior preservation loss with 1000 pre-generated class samples. Lo RA [13] is applied for the cross and self-attention layers and fine-tuned for 1000 steps. We use Adam W optimizer with the learning rate of 1e-5 for full-finetuning and 1e-4 for the others. Assuming a resource-constrained environment, we use a batch size of 1 and do not update the pre-trained text encoder, while text embeddings are pre-computed before fine-tuning.