reproducibilityindex.ai

Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

Authors: Jiajun Wang, Morteza Ghahremani Boozandani, Yitong Li, Björn Ommer, Christian Wachinger

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experimental Results
Researcher Affiliation	Academia	Jiajun Wang 1, Morteza Ghahremani 1,3, Yitong Li 1,3, Björn Ommer2,3, and Christian Wachinger1,3 1Lab for AI in Medical Imaging, Technical University of Munich (TUM), Germany 2Comp Vis @ LMU Munich, Germany 3Munich Center for Machine Learning (MCML), Germany
Pseudocode	Yes	Algorithm 1 Generation of the attention mask for PMSA
Open Source Code	Yes	The project link and code are available at https://github.com/ai-med/Stable Pose.
Open Datasets	Yes	We assessed the performance of the proposed Stable-Pose as well as competing methods on five large-scale human-centric datasets including Human-Art [15], LAION-Human [16], UBC Fashion [43], Dance Track [40], and DAVIS [27] dataset.
Dataset Splits	Yes	On the Human-Art dataset, we trained all techniques, including ours for 10 epochs to ensure a fair comparison. On the LAIONHuman subset, we trained Stable-Pose, Human SD [16], GLIGEN [18] and Uni-Control Net [46] for 10 epochs, while we used released checkpoints from other techniques due to computational limitations. ... Human-Art: ... We adopt the same train-validation split as the authors suggested. ... LAION-Human: ... We randomly selected a subset of 200,000 images for training and 20,000 images for validation.
Hardware Specification	Yes	The training was executed using two NVIDIA A100 GPUs... Training was conducted on two NVIDIA A100 GPUs.
Software Dependencies	No	Similar to previous work [44; 25; 46], we fine-tuned our model on SD with version 1.5. We utilized Adam [17] optimizer with a learning rate of 1 10 5.
Experiment Setup	Yes	We utilized Adam [17] optimizer with a learning rate of 1 10 5. For our proposed PMSA Vi T module, we adopted a depth of 2 and a patch size of 2, where coarse-to-fine pose masks were generated using two Gaussian filters, each with a sigma value of 3 but with differing kernel sizes of 23 and 13, respectively. ... In the pose-mask guided loss function, we set an α of 5 as the guidance factor. We also followed [44] to randomly replace text prompts as empty strings at a probability of 0.5...