Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

Authors: Jiajun Wang, Morteza Ghahremani Boozandani, Yitong Li, Björn Ommer, Christian Wachinger

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experimental Results
Researcher Affiliation Academia Jiajun Wang 1, Morteza Ghahremani 1,3, Yitong Li 1,3, Björn Ommer2,3, and Christian Wachinger1,3 1Lab for AI in Medical Imaging, Technical University of Munich (TUM), Germany 2Comp Vis @ LMU Munich, Germany 3Munich Center for Machine Learning (MCML), Germany
Pseudocode Yes Algorithm 1 Generation of the attention mask for PMSA
Open Source Code Yes The project link and code are available at https://github.com/ai-med/Stable Pose.
Open Datasets Yes We assessed the performance of the proposed Stable-Pose as well as competing methods on five large-scale human-centric datasets including Human-Art [15], LAION-Human [16], UBC Fashion [43], Dance Track [40], and DAVIS [27] dataset.
Dataset Splits Yes On the Human-Art dataset, we trained all techniques, including ours for 10 epochs to ensure a fair comparison. On the LAIONHuman subset, we trained Stable-Pose, Human SD [16], GLIGEN [18] and Uni-Control Net [46] for 10 epochs, while we used released checkpoints from other techniques due to computational limitations. ... Human-Art: ... We adopt the same train-validation split as the authors suggested. ... LAION-Human: ... We randomly selected a subset of 200,000 images for training and 20,000 images for validation.
Hardware Specification Yes The training was executed using two NVIDIA A100 GPUs... Training was conducted on two NVIDIA A100 GPUs.
Software Dependencies No Similar to previous work [44; 25; 46], we fine-tuned our model on SD with version 1.5. We utilized Adam [17] optimizer with a learning rate of 1 10 5.
Experiment Setup Yes We utilized Adam [17] optimizer with a learning rate of 1 10 5. For our proposed PMSA Vi T module, we adopted a depth of 2 and a patch size of 2, where coarse-to-fine pose masks were generated using two Gaussian filters, each with a sigma value of 3 but with differing kernel sizes of 23 and 13, respectively. ... In the pose-mask guided loss function, we set an α of 5 as the guidance factor. We also followed [44] to randomly replace text prompts as empty strings at a probability of 0.5...