Hollowed Net for On-Device Personalization of Text-to-Image Diffusion Models
Authors: Wonguk Cho, Seokeon Choi, Debasmit Das, Matthias Reisser, Taesup Kim, Sungrack Yun, Fatih Porikli
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Quantitative and qualitative analyses demonstrate that our approach not only reduces training memory to levels as low as those required for inference but also maintains or improves personalization performance compared to existing methods. 5 Experiments |
| Researcher Affiliation | Collaboration | 1Qualcomm AI Research 2Seoul National University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The code would be available after an internal review process has been completed (not available at this submission time). |
| Open Datasets | Yes | We use a total of 131 subjects for experiments, utilizing both the Dream Booth [5] and Custom Concept101 [7] datasets. |
| Dataset Splits | No | The paper describes using Dream Booth and Custom Concept101 datasets for experiments and fine-tuning but does not explicitly provide training/validation/test dataset splits within its text. |
| Hardware Specification | No | The paper frequently mentions 'GPU memory' and 'computational resources' and discusses memory usage in GB (e.g., '3.88GB of GPU memory usage'), but it does not specify any particular GPU or CPU models used for the experiments. |
| Software Dependencies | No | The paper mentions 'Adam W optimizer' and 'Stable Diffusion v2.1 diffusion model', but it does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | Following Dream Booth [5], we use a prior preservation loss with 1000 pre-generated class samples. Lo RA [13] is applied for the cross and self-attention layers and fine-tuned for 1000 steps. We use Adam W optimizer with the learning rate of 1e-5 for full-finetuning and 1e-4 for the others. Assuming a resource-constrained environment, we use a batch size of 1 and do not update the pre-trained text encoder, while text embeddings are pre-computed before fine-tuning. |