Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Generalized Interpolating Discrete Diffusion
Authors: Dimitri Von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, Thomas Hofmann
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the practical side, in Sections 4 and 5 we apply our theory to the special case of masking noise in combination with varying levels of uniform noise. We conduct an ablation study, showing that our mask-only model achieves compute-matched state-of-the-art on diffusion language modeling thanks to a reweighted training objective (Sec. 5.2). We also show that the addition of uniform noise leads to improved sample quality and unlocks self-correction abilities (Fig. 1, Tab. 1) that allows the model to iteratively improve samples beyond what is possible by simply traversing the backward diffusion process (Sec. 5.4). |
| Researcher Affiliation | Academia | 1Data Analytics Lab, Department of Computer Science, ETH Zurich 2ELLIS Institute T ubingen 3Max Planck Institute for Intelligent Systems, T ubingen. Correspondence to: Dimitri von R utte <EMAIL>. |
| Pseudocode | Yes | A pseudocode implementation is given in Algorithm 1. |
| Open Source Code | Yes | Code: https://github.com/dvruette/gidd/ |
| Open Datasets | Yes | To this end, we adopt the Open Web Text (OWT) dataset (Gokaslan et al., 2019) since there exists a rich literature for both autoregressive and diffusion models trained on this dataset. |
| Dataset Splits | Yes | For computing validation metrics, we reserve the last 100k samples (~1.25%) of the training set (Open Web Text). Validation samples that are longer than the context length are cropped to a random window for consistency with training. For sequences longer than 512 tokens we select a random window of 512 tokens, while short sequences are padded to a length of 512. |
| Hardware Specification | Yes | All models are trained with a context size of 512 tokens and batch size of 512 for 500k steps (resulting in a total of 131B training tokens) on a single node of 8 NVIDIA A100/H100-80GB GPUs in bfloat16 precision using Pytorch s mixed precision training (torch.cuda.autocast). |
| Software Dependencies | No | All models are trained with a context size of 512 tokens and batch size of 512 for 500k steps (resulting in a total of 131B training tokens) on a single node of 8 NVIDIA A100/H100-80GB GPUs in bfloat16 precision using Pytorch s mixed precision training (torch.cuda.autocast). For optimization, we use the Adam optimizer (Kingma & Ba, 2017). |
| Experiment Setup | Yes | All our models are based on the Di T architecture (Peebles & Xie, 2023) and use the GPT2 tokenizer (Radford et al., 2019). We train models of three different sizes: TINY (L = 6, H = 8, d = 512; 28.4M non-emb. params.), SMALL (L = 12, H = 12, d = 768; 92.1M non-emb. params.), and BASE (L = 24, H = 16, d = 1024; 321.2M non-emb. params.), where L denotes the number of layers, H the number of attention heads, and d the dimensionality of hidden states. All models are trained with a context size of 512 tokens and batch size of 512 for 500k steps (resulting in a total of 131B training tokens)... For optimization, we use the Adam optimizer (Kingma & Ba, 2017) with β = (0.9, 0.99), ϵ = 10 9, and a learning rate of 5 10 4. The learning rate is warmed up linearly for the first 10k steps and then decayed using a cosine schedule to 10% of the initial learning rate. We use weight decay 0.0 for our ablations (unless stated otherwise) and 0.02 for the final configuration, also referred to as GIDD+. We also use gradient clipping to a norm of 1.0. |