Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models
Authors: Fei Shen, Hu Ye, Jun Zhang, Cong Wang, Xiao Han, Yang Wei
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios. |
| Researcher Affiliation | Industry | Fei Shen , Hu Ye , Jun Zhang , Cong Wang, Xiao Han, Wei Yang Tencent AI Lab {ffeishen, huye, junejzhang, xvencewang, haroldhan, willyang}@tencent.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and model will be available at https://github.com/tencent-ailab/PCDMs. |
| Open Datasets | Yes | We carry out experiments on Deep Fashion (Liu et al., 2016), which consists of 52,712 high-resolution images of fashion models, and Market-1501 (Zheng et al., 2015) including 32,668 low-resolution images with diverse backgrounds, viewpoints, and lighting conditions. |
| Dataset Splits | Yes | We follow the dataset splits provided by (Bhunia et al., 2023). Note that the person ID of the training and testing sets do not overlap for both datasets. |
| Hardware Specification | Yes | We perform our experiments on 8 NVIDIA V100 GPUs. |
| Software Dependencies | Yes | For the prior conditional diffusion model, we employ Open CLIP Vi T-H/14 2 as the CLIP image encoder... For the inpainting and refining models, we use DINOv2-G/14 3 as the image encoder. We leverage the pretrained stable diffusion V2.1 4, modifying the first convolution layer to adapt additional conditions. |
| Experiment Setup | Yes | Our configurations can be summarized as follows: (1) the transformer of the prior model has 20 transformer blocks with a width of 2,048. For the inpainting model and refining model, we use the pretrained stable diffusion V2.1 1 and modify the first convolution layer to adapt additional conditions. (2) We employ the Adam W optimizer with a fixed learning rate of 1e 4 in all stages. (3) Following (Ren et al., 2022; Bhunia et al., 2023), we train our models using images of sizes 256 176 and 512 352 for Deep Fashion dataset. For the Market-1501 dataset, we utilize images of size 128 64. Please refer to B of the Appendix for more detail. (Supplementary Section B): We utilize the Adam W optimizer with a consistent learning rate of 1e 4 across all stages. The probability of random dropout for condition c is set at 10%. ... The model is trained for 100k iterations with a batch size of 256, using a cosine noising schedule (Nichol & Dhariwal, 2021) with 1000 timesteps. For the inpainting and refining models... These models are trained for 200k and 100k iterations, respectively, each with a batch size of 128, and a linear noise schedule with 1000 timesteps is applied. In the inference stage, we use the DDIM sampler with 20 steps and set the guidance scale w to 2.0 for PCDMs on all stages. |