Cross-view Masked Diffusion Transformers for Person Image Synthesis

Authors: Trung X. Pham, Kang Zhang, Chang D. Yoo

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental X-MDPT demonstrates scalability, improving FID, SSIM, and LPIPS with larger models. Despite its simple design, our model outperforms state-of-the-art approaches on the Deep Fashion dataset while exhibiting efficiency in terms of training parameters, training time, and inference speed. Our compact 33MB model achieves an FID of 7.42, surpassing a prior Unet latent diffusion approach (FID 8.07) using only 11 fewer parameters.
Researcher Affiliation Academia Trung X. Pham * Zhang Kang * Chang D. Yoo Korea Advanced Institute of Science and Technology (KAIST)
Pseudocode No The paper includes mathematical equations and descriptions of the model architecture, but no structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/trungpx/xmdpt.
Open Datasets Yes We evaluate our method against the state-of-the-art (SOTA) using high-resolution images from the Deep Fashion In-shop Clothes Retrieval Benchmark dataset (Liu et al., 2016) at resolutions of 256 256 and 512 512.
Dataset Splits No The paper states: 'The dataset comprises non-overlapping train and test subsets, containing 101,966 and 8,570 pairs, respectively.' It does not explicitly mention a validation split percentage or count.
Hardware Specification Yes For 256 256 images, training was conducted on a single A100 GPU (80GB RAM) with a batch size of 32, spanning 800k steps. Meanwhile, for 512 512 images, we employed two A100 GPUs with a batch size of 10 (5 images per GPU), trained for 1M steps.
Software Dependencies No The paper mentions using 'pre-trained VAE ft-MSE of Stable Diffusion', 'Open Pose', 'Open CV', and 'DINOv2' but does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes For 256 256 images, training was conducted on a single A100 GPU (80GB RAM) with a batch size of 32, spanning 800k steps. Meanwhile, for 512 512 images, we employed two A100 GPUs with a batch size of 10 (5 images per GPU), trained for 1M steps. For ablations, we trained X-MDPT-B with 300k steps on a 256 256 resolution. The learning rate was set to 1e-4, the model s EMA rate to 0.9999, and other settings aligned with Di T (Peebles & Xie, 2023). We use 50 steps DDIM (Song et al., 2020) for inference.