Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Simplified and Generalized Masked Diffusion for Discrete Data
Authors: Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, Michalis Titsias
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On GPT-2 scale text modeling and pixel-level image modeling tasks, masked diffusions trained using our simple ELBO objective outperform previous proposals, leading to the best likelihood and zero-shot transfer performance among discrete diffusion models. 7 Experiments |
| Researcher Affiliation | Industry | Jiaxin Shi , Kehang Han , Zhe Wang, Arnaud Doucet, Michalis K. Titsias Google Deep Mind Correspondence to: EMAIL. |
| Pseudocode | Yes | A single step of MD4 training algorithm is described in Alg. 1 in Appendix. A complete description of the sampling algorithm can be found in Alg. 2 in Appendix. |
| Open Source Code | Yes | Our code is available at https://github.com/google-deepmind/md4. |
| Open Datasets | Yes | text8 [55], a character-level text modeling benchmark, and Open Web Text [56], an open clone of the unreleased Web Text dataset used to train GPT-2 [57]. train MD4 on order-agnostic image data from CIFAR-10 and downsampled Image Net 64 64 [63]. |
| Dataset Splits | Yes | We kept 2% of the original training set for validation. |
| Hardware Specification | Yes | Our model is trained on 16 TPU-v5 lite for less than a day. Our CIFAR-10 model is trained on 32 TPU-v5 lite for 24 hours. Our Image Net-64 64 model is trained on 256 TPU-v5 lite for 3.5 days. |
| Software Dependencies | No | The paper mentions 'JAX [45] implementation of categorical sampling' but does not specify its version or any other software dependencies with version numbers. |
| Experiment Setup | Yes | We used a cosine learning rate schedule with a linear warm up of 2000 steps. We applied channel-wise dropout of rate 0.05 and used Adam W optimizer with learning rate 0.0003 and a weight decay factor of 0.03. We kept the training hyperparameters the same as text8 experiment except that we reduced the dropout rate to 0.02. We used Adam W optimizer and trained for 2M iterations. We used learning rate 0.0004, batch size 256, weight decay factor 0.01 for CIFAR-10 and learning rate 0.0002, batch size 512, weight decay factor 0.03 for Image Net 64 64. |