MC-DiT: Contextual Enhancement via Clean-to-Clean Reconstruction for Masked Diffusion Models
Authors: Guanghao Zheng, Yuchen Liu, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results on 256 256 and 512 512 image generation on the Image Net dataset demonstrate that the proposed MC-Di T achieves state-of-the-art performance in unconditional and conditional image generation with enhanced convergence speed. |
| Researcher Affiliation | Academia | Guanghao Zheng, Yuchen Liu, Wenrui Dai , Chenglin Li, Junni Zou, Hongkai Xiong School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University |
| Pseudocode | No | The paper describes the proposed method using text and mathematical equations, but it does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Justification: We have provide the core file of our code in the supplementary. And the code will be released upon acceptance. |
| Open Datasets | Yes | We train MC-Di T on Image Net [39] with resolutions 256 256 3 and 512 512 3, respectively. |
| Dataset Splits | No | The paper mentions training and testing but does not explicitly detail a validation dataset split or how it was used. |
| Hardware Specification | Yes | Table 8: GPUs 2 RTX-3090 GPUs 4 V100 GPUs |
| Software Dependencies | No | The paper mentions software components like "Adam W optimizer", "pretrained variational autoencoder (VAE) from Stable Diffuion [37]", and "EDM [21] framework", but does not specify version numbers for these or other libraries/frameworks. |
| Experiment Setup | Yes | Most training settings are the same as Mask Di T [48]. We train MC-Di T for 400K to 1M iterations using the Adam W optimizer with learning rate 0.0001 and no weight decay. By default, we use 50% mask ratio and batch size 1024. λ1 and λ2 in (12) are set to 0.1 and 0.05 for more denoising reconstruction. The EMA coefficient is set to 0.999 for smoothness and no data augmentation is employed. |