MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts
Authors: Jie Zhu, Yixiong Chen, Mingyu Ding, Ping Luo, Leye Wang, Jingdong Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate the superiority of Mo LE in the context of human-centric image generation compared to state-of-the-art, we construct two benchmarks and perform evaluations with diverse metrics and human studies. |
| Researcher Affiliation | Collaboration | Jie Zhu1,2, Yixiong Chen3, Mingyu Ding4, Ping Luo5, Leye Wang1,2 , Jingdong Wang6 1Key Lab of High Confidence Software Technologies (Peking University), Ministry of Education, China 2School of Computer Science, Peking University, Beijing, China, 3Johns Hopkins University 4UC Berkeley, 5The University of Hong Kong, 6Baidu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such. Figure 5 presents a framework diagram but not pseudocode. |
| Open Source Code | Yes | Datasets, model, and code are released at project website. [...] Our objective is to establish transparent and open-source endeavors for the advancement of the community in generating more realistic human hands/faces. [...] We maintain transparency in our methods with open-source code and dataset composition, allowing for continuous improvement based on community feedback. |
| Open Datasets | Yes | Our human-centric dataset involves over one million high-quality images, containing three parts (See Sec 3.1). [...] Our collection is in compliance with the ethics and law as all images are collected from websites under Public Domain CC0 1.0 6 license that allows free use, redistribution, and adaptation for non-commercial purposes. [...] Datasets, model, and code are released at project website. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits for the main model training. While it mentions using the COCO val set for prompt construction and a 'val set' for filter training in Appendix A.3, it does not specify a validation split for the diffusion model training itself. |
| Hardware Specification | Yes | Finally, Mo LE is resource-friendly and can be trained in a single A100 80G GPU. |
| Software Dependencies | No | The paper mentions software components like "Stable Diffusion v1.5" (which is a model, not a dependency in the usual sense for this question) and optimizers like "Lion optimizer [5]" and "Adam W optimizer", but does not specify version numbers for programming languages, libraries (e.g., PyTorch, TensorFlow), or other key software dependencies. |
| Experiment Setup | Yes | Stage 1: Fine-tuning on human-centric Dataset. We use Stable Diffusion v1.5 as base model and fine-tune the UNet (and text encoder) with a constant learning rate 2e 6. We set batch size to 64 and train with the Min-SNR weighting strategy [14]. The clip skip is 1 and we train the model for 300k steps using Lion optimizer [5]. [...] Stage 2: Low-rank Expert Generation. For face expert, we set batch size to 64 and train it 30k steps with a constant learning rate 2e 5. The rank is set to 256 and Adam W optimizer is used. For hand expert, we set batch size to 64. [...] train it 60k steps with a smaller learning rate 1e 5. The rank is also set to 256 and Adam W optimizer is used. [...] Stage 3: Mixture Adaptation. In this stage, we use the batch size 64 and employ Adam W optimizer. We use a constant learning rate 1e 5 and train for 50k steps. |