Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
IMAGPose: A Unified Conditional Framework for Pose-Guided Person Generation
Authors: Fei Shen, Jinhui Tang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiment results demonstrate the consistency and photorealism of our proposed IMAGPose under challenging user scenarios. The code and model will be available at https://github.com/muzishen/IMAGPose. |
| Researcher Affiliation | Academia | Fei Shen, Jinhui Tang Nanjing University of Science and Technology EMAIL |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and model will be available at https://github.com/muzishen/IMAGPose. |
| Open Datasets | Yes | We conducted experiments on the Deep Fashion dataset [21], which consists of 52,712 high-resolution images of fashion models, and the Market-1501 dataset [54], which includes 32,668 low-resolution images with diverse backgrounds, viewpoints, and lighting conditions. |
| Dataset Splits | Yes | We extracted the skeletons using Open Pose [3] and followed the dataset splits provided by [1]. It s important to note that the person IDs of the training and testing sets do not overlap for both datasets. |
| Hardware Specification | Yes | We conduct experiments on 8 NVIDIA V100 GPUs. |
| Software Dependencies | Yes | We use the pre-trained Stable Diffusion V1.5 3 and modified the first convolutional layer to accommodate additional conditions. Unless otherwise specified, we use Dinov2-G/14 4 as the image encoder. ... For the pose condition, we introduced a pose encoder identical to Control Net 2 for injection after the first convolutional layer. |
| Experiment Setup | Yes | Our configuration can be summarized as follows: (a) We use the pre-trained Stable Diffusion V1.5 3 and modified the first convolutional layer to accommodate additional conditions. Unless otherwise specified, we use Dinov2-G/14 4 as the image encoder. In the tokenizer layer, both the kernel size and stride of the 2D convolution are 16, and the dimensions of the input and output channels are 4 and 768, respectively. (b) Following [1, 36], we train our model on the Deep Fashion dataset with sizes of 256 × 176 and 512 × 352. For the Market-1501 dataset, we used images of size 128 × 64. (c) In the masking strategy, we defaulted to randomly occluding 1-4 images. (d) The model is trained for 300k steps using the Adam W optimizer with a learning rate of 5e 5. Each batch size is 4, and a linear noise schedule of 1000 time steps is applied. (e) In the inference stage, we used a DDIM sampler with 20 steps, and set w to 2.0 in the guidance scale. |