Decoupled Textual Embeddings for Customized Image Generation
Authors: Yufei Cai, Yuxiang Wei, Zhilong Ji, Jinfeng Bai, Hu Han, Wangmeng Zuo
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that the subject embedding obtained by our method can faithfully represent the target concept, while showing superior editability compared to the state-of-the-art methods. |
| Researcher Affiliation | Collaboration | 1Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Harbin Institute of Technology 4Tomorrow Advancing Life 5Pengcheng Lab |
| Pseudocode | No | The paper includes a diagram of the framework (Figure 2) but no explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code will be available at https://github. com/Prototype Nx/DETEX. |
| Open Datasets | Yes | We conduct our experiment on the Dream Bench (Ruiz et al. 2023) dataset. It consists of 30 subjects with different categories (e.g., animals, toys, and wearable items), and each subject has 4 7 images. |
| Dataset Splits | No | The paper mentions data used for training and evaluation but does not specify a distinct validation set with details on its size or use for hyperparameter tuning. |
| Hardware Specification | Yes | The training process is conducted on RTX 3090 using Adam W (Loshchilov and Hutter 2018) optimizer with a batch size of 4 for 600 steps. |
| Software Dependencies | No | The paper mentions 'Stable Diffusion v14' but does not provide specific version numbers for other software dependencies or libraries. |
| Experiment Setup | Yes | The learning rate is set as 1e-5, and the drop probability γ is set as 0.5. Our embedding mappers are implemented as the 3-layer MLPs, and each has a size of 6.7 MB. The cross attention loss (Eqn. 4) is calculated at the resolution of 32 32, and we average the attention maps along the head dimension. During testing, we generate images with 50 DDIM (Song, Meng, and Ermon 2020) steps, and the scale of classifier-free guidance is 6. |