IMAGPose: A Unified Conditional Framework for Pose-Guided Person Generation
Authors: Fei Shen, Jinhui Tang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiment results demonstrate the consistency and photorealism of our proposed IMAGPose under challenging user scenarios. The code and model will be available at https://github.com/muzishen/IMAGPose. |
| Researcher Affiliation | Academia | Fei Shen, Jinhui Tang Nanjing University of Science and Technology {feishen, jinhuitang}@njust.edu.cn |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and model will be available at https://github.com/muzishen/IMAGPose. |
| Open Datasets | Yes | We conducted experiments on the Deep Fashion dataset [21], which consists of 52,712 high-resolution images of fashion models, and the Market-1501 dataset [54], which includes 32,668 low-resolution images with diverse backgrounds, viewpoints, and lighting conditions. |
| Dataset Splits | Yes | We extracted the skeletons using Open Pose [3] and followed the dataset splits provided by [1]. It s important to note that the person IDs of the training and testing sets do not overlap for both datasets. |
| Hardware Specification | Yes | We conduct experiments on 8 NVIDIA V100 GPUs. |
| Software Dependencies | Yes | We use the pre-trained Stable Diffusion V1.5 3 and modified the first convolutional layer to accommodate additional conditions. Unless otherwise specified, we use Dinov2-G/14 4 as the image encoder. ... For the pose condition, we introduced a pose encoder identical to Control Net 2 for injection after the first convolutional layer. |
| Experiment Setup | Yes | Our configuration can be summarized as follows: (a) We use the pre-trained Stable Diffusion V1.5 3 and modified the first convolutional layer to accommodate additional conditions. Unless otherwise specified, we use Dinov2-G/14 4 as the image encoder. In the tokenizer layer, both the kernel size and stride of the 2D convolution are 16, and the dimensions of the input and output channels are 4 and 768, respectively. (b) Following [1, 36], we train our model on the Deep Fashion dataset with sizes of 256 × 176 and 512 × 352. For the Market-1501 dataset, we used images of size 128 × 64. (c) In the masking strategy, we defaulted to randomly occluding 1-4 images. (d) The model is trained for 300k steps using the Adam W optimizer with a learning rate of 5e 5. Each batch size is 4, and a linear noise schedule of 1000 time steps is applied. (e) In the inference stage, we used a DDIM sampler with 20 steps, and set w to 2.0 in the guidance scale. |