Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models

Authors: Fei Shen, Hu Ye, Jun Zhang, Cong Wang, Xiao Han, Yang Wei

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.
Researcher Affiliation Industry Fei Shen , Hu Ye , Jun Zhang , Cong Wang, Xiao Han, Wei Yang Tencent AI Lab {ffeishen, huye, junejzhang, xvencewang, haroldhan, willyang}@tencent.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code and model will be available at https://github.com/tencent-ailab/PCDMs.
Open Datasets Yes We carry out experiments on Deep Fashion (Liu et al., 2016), which consists of 52,712 high-resolution images of fashion models, and Market-1501 (Zheng et al., 2015) including 32,668 low-resolution images with diverse backgrounds, viewpoints, and lighting conditions.
Dataset Splits Yes We follow the dataset splits provided by (Bhunia et al., 2023). Note that the person ID of the training and testing sets do not overlap for both datasets.
Hardware Specification Yes We perform our experiments on 8 NVIDIA V100 GPUs.
Software Dependencies Yes For the prior conditional diffusion model, we employ Open CLIP Vi T-H/14 2 as the CLIP image encoder... For the inpainting and refining models, we use DINOv2-G/14 3 as the image encoder. We leverage the pretrained stable diffusion V2.1 4, modifying the first convolution layer to adapt additional conditions.
Experiment Setup Yes Our configurations can be summarized as follows: (1) the transformer of the prior model has 20 transformer blocks with a width of 2,048. For the inpainting model and refining model, we use the pretrained stable diffusion V2.1 1 and modify the first convolution layer to adapt additional conditions. (2) We employ the Adam W optimizer with a fixed learning rate of 1e 4 in all stages. (3) Following (Ren et al., 2022; Bhunia et al., 2023), we train our models using images of sizes 256 176 and 512 352 for Deep Fashion dataset. For the Market-1501 dataset, we utilize images of size 128 64. Please refer to B of the Appendix for more detail. (Supplementary Section B): We utilize the Adam W optimizer with a consistent learning rate of 1e 4 across all stages. The probability of random dropout for condition c is set at 10%. ... The model is trained for 100k iterations with a batch size of 256, using a cosine noising schedule (Nichol & Dhariwal, 2021) with 1000 timesteps. For the inpainting and refining models... These models are trained for 200k and 100k iterations, respectively, each with a batch size of 128, and a linear noise schedule with 1000 timesteps is applied. In the inference stage, we use the DDIM sampler with 20 steps and set the guidance scale w to 2.0 for PCDMs on all stages.