Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation
Authors: Jiajun Wang, Morteza Ghahremani Boozandani, Yitong Li, Björn Ommer, Christian Wachinger
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experimental Results |
| Researcher Affiliation | Academia | Jiajun Wang 1, Morteza Ghahremani 1,3, Yitong Li 1,3, Björn Ommer2,3, and Christian Wachinger1,3 1Lab for AI in Medical Imaging, Technical University of Munich (TUM), Germany 2Comp Vis @ LMU Munich, Germany 3Munich Center for Machine Learning (MCML), Germany |
| Pseudocode | Yes | Algorithm 1 Generation of the attention mask for PMSA |
| Open Source Code | Yes | The project link and code are available at https://github.com/ai-med/Stable Pose. |
| Open Datasets | Yes | We assessed the performance of the proposed Stable-Pose as well as competing methods on five large-scale human-centric datasets including Human-Art [15], LAION-Human [16], UBC Fashion [43], Dance Track [40], and DAVIS [27] dataset. |
| Dataset Splits | Yes | On the Human-Art dataset, we trained all techniques, including ours for 10 epochs to ensure a fair comparison. On the LAIONHuman subset, we trained Stable-Pose, Human SD [16], GLIGEN [18] and Uni-Control Net [46] for 10 epochs, while we used released checkpoints from other techniques due to computational limitations. ... Human-Art: ... We adopt the same train-validation split as the authors suggested. ... LAION-Human: ... We randomly selected a subset of 200,000 images for training and 20,000 images for validation. |
| Hardware Specification | Yes | The training was executed using two NVIDIA A100 GPUs... Training was conducted on two NVIDIA A100 GPUs. |
| Software Dependencies | No | Similar to previous work [44; 25; 46], we fine-tuned our model on SD with version 1.5. We utilized Adam [17] optimizer with a learning rate of 1 10 5. |
| Experiment Setup | Yes | We utilized Adam [17] optimizer with a learning rate of 1 10 5. For our proposed PMSA Vi T module, we adopted a depth of 2 and a patch size of 2, where coarse-to-fine pose masks were generated using two Gaussian filters, each with a sigma value of 3 but with differing kernel sizes of 23 and 13, respectively. ... In the pose-mask guided loss function, we set an α of 5 as the guidance factor. We also followed [44] to randomly replace text prompts as empty strings at a probability of 0.5... |