IMAGPose: A Unified Conditional Framework for Pose-Guided Person Generation

Authors: Fei Shen, Jinhui Tang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiment results demonstrate the consistency and photorealism of our proposed IMAGPose under challenging user scenarios. The code and model will be available at https://github.com/muzishen/IMAGPose.
Researcher Affiliation Academia Fei Shen, Jinhui Tang Nanjing University of Science and Technology {feishen, jinhuitang}@njust.edu.cn
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code and model will be available at https://github.com/muzishen/IMAGPose.
Open Datasets Yes We conducted experiments on the Deep Fashion dataset [21], which consists of 52,712 high-resolution images of fashion models, and the Market-1501 dataset [54], which includes 32,668 low-resolution images with diverse backgrounds, viewpoints, and lighting conditions.
Dataset Splits Yes We extracted the skeletons using Open Pose [3] and followed the dataset splits provided by [1]. It s important to note that the person IDs of the training and testing sets do not overlap for both datasets.
Hardware Specification Yes We conduct experiments on 8 NVIDIA V100 GPUs.
Software Dependencies Yes We use the pre-trained Stable Diffusion V1.5 3 and modified the first convolutional layer to accommodate additional conditions. Unless otherwise specified, we use Dinov2-G/14 4 as the image encoder. ... For the pose condition, we introduced a pose encoder identical to Control Net 2 for injection after the first convolutional layer.
Experiment Setup Yes Our configuration can be summarized as follows: (a) We use the pre-trained Stable Diffusion V1.5 3 and modified the first convolutional layer to accommodate additional conditions. Unless otherwise specified, we use Dinov2-G/14 4 as the image encoder. In the tokenizer layer, both the kernel size and stride of the 2D convolution are 16, and the dimensions of the input and output channels are 4 and 768, respectively. (b) Following [1, 36], we train our model on the Deep Fashion dataset with sizes of 256 × 176 and 512 × 352. For the Market-1501 dataset, we used images of size 128 × 64. (c) In the masking strategy, we defaulted to randomly occluding 1-4 images. (d) The model is trained for 300k steps using the Adam W optimizer with a learning rate of 5e 5. Each batch size is 4, and a linear noise schedule of 1000 time steps is applied. (e) In the inference stage, we used a DDIM sampler with 20 steps, and set w to 2.0 in the guidance scale.