Cross-view Masked Diffusion Transformers for Person Image Synthesis
Authors: Trung X. Pham, Kang Zhang, Chang D. Yoo
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | X-MDPT demonstrates scalability, improving FID, SSIM, and LPIPS with larger models. Despite its simple design, our model outperforms state-of-the-art approaches on the Deep Fashion dataset while exhibiting efficiency in terms of training parameters, training time, and inference speed. Our compact 33MB model achieves an FID of 7.42, surpassing a prior Unet latent diffusion approach (FID 8.07) using only 11 fewer parameters. |
| Researcher Affiliation | Academia | Trung X. Pham * Zhang Kang * Chang D. Yoo Korea Advanced Institute of Science and Technology (KAIST) |
| Pseudocode | No | The paper includes mathematical equations and descriptions of the model architecture, but no structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/trungpx/xmdpt. |
| Open Datasets | Yes | We evaluate our method against the state-of-the-art (SOTA) using high-resolution images from the Deep Fashion In-shop Clothes Retrieval Benchmark dataset (Liu et al., 2016) at resolutions of 256 256 and 512 512. |
| Dataset Splits | No | The paper states: 'The dataset comprises non-overlapping train and test subsets, containing 101,966 and 8,570 pairs, respectively.' It does not explicitly mention a validation split percentage or count. |
| Hardware Specification | Yes | For 256 256 images, training was conducted on a single A100 GPU (80GB RAM) with a batch size of 32, spanning 800k steps. Meanwhile, for 512 512 images, we employed two A100 GPUs with a batch size of 10 (5 images per GPU), trained for 1M steps. |
| Software Dependencies | No | The paper mentions using 'pre-trained VAE ft-MSE of Stable Diffusion', 'Open Pose', 'Open CV', and 'DINOv2' but does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | For 256 256 images, training was conducted on a single A100 GPU (80GB RAM) with a batch size of 32, spanning 800k steps. Meanwhile, for 512 512 images, we employed two A100 GPUs with a batch size of 10 (5 images per GPU), trained for 1M steps. For ablations, we trained X-MDPT-B with 300k steps on a 256 256 resolution. The learning rate was set to 1e-4, the model s EMA rate to 0.9999, and other settings aligned with Di T (Peebles & Xie, 2023). We use 50 steps DDIM (Song et al., 2020) for inference. |