Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Anchored Diffusion Language Model

Authors: Litu Rout, Constantine Caramanis, Sanjay Shakkottai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental ADLM significantly improves test perplexity on LM1B and Open Web Text, achieving up to 25.4% gains over prior DLMs, and narrows the gap with strong AR baselines. It also achieves state-of-the-art performance in zero-shot generalization across seven benchmarks and surpasses AR models in MAUVE score, which marks the first time a DLM generates better human-like text than an AR model. Theoretically, we derive an Anchored Negative Evidence Lower Bound (ANELBO) objective and show that anchoring improves sample complexity and likelihood modeling. Beyond diffusion, anchoring boosts performance in AR models and enhances reasoning in math and logic tasks, outperforming existing chain-of-thought approaches.
Researcher Affiliation Academia Litu Rout Constantine Caramanis Sanjay Shakkottai The University of Texas at Austin EMAIL
Pseudocode Yes Algorithm 1: Anchored Diffusion Language Model (ADLM) Input: Anchor network yφ( ), denoising network xψ( ), number of steps T, noise schedule αt, remasking schedule σt Output: Generated sequence z0 1 Initialize z T (m, m, . . . , m) Fully masked sequence 2 for i = T to 1 do 3 t = i/T, s = (i 1)/T 4 Compute noise schedule: αt, αs 5 Compute remasking schedule: σt [0, σmax t ] Follows Eq. (9) (Wang et al., 2025a) 6 Compute anchor transition using noisy sequence: Eq. (5) in 3 r(yl s|zl t, yφ(zt)) = ( Cat(yl s; (1 σt)yl + σtm), zl t = m Cat(yl s; αs (1 σt)αt 1 αt yl φ(zt) + 1 αs αtσt 1 αt m), zl t = m 7 Compute inference posterior using anchored prediction yφ(zt): Eq. (6) in 3 q(zl s|zl t, xl ψ(yφ(zt))) = ( Cat(zl s; (1 σt)xl + σtm), zl t = m Cat(zl s; αs (1 σt)αt 1 αt xl ψ(yφ(zt)) + 1 αs αtσt 1 αt m), zl t = m 8 Sample zl s q(zl s|zl t, xl ψ(yφ(zt))) for all l {1, . . . , L} 9 Update zt zs 10 end 11 return z0
Open Source Code Yes Please see our project page: https://anchored-diffusion-llm.github.io/ for code and demo.
Open Datasets Yes We evaluate ADLM on two benchmarks: One Billion Words (LM1B) (Chelba et al., 2013) and Open Web Text (OWT) (Gokaslan & Cohen, 2019)... One Billion Words (LM1B). We use the LM1B dataset (Chelba et al., 2013)6 which consists of news crawl data collected by Chelba et al. (2013). The dataset is released under the Apache 2.0 license. Open Web Text (OWT). We use the Open Web Text dataset (Gokaslan & Cohen, 2019)7, which is a public reproduction of the Web Text dataset originally used in GPT-2 (Radford et al., 2019). It consists of web content extracted from high-quality Reddit URLs. The dataset is licensed under Creative Commons CC0 license ( no rights reserved ).
Dataset Splits Yes For LM1B, we use a context length of 128 with the BERT-base-uncased tokenizer and evaluate on the standard test split... As OWT dataset does not provide an official train/test split, we use the splits used in prior work (Sahoo et al., 2024) and train ADLM for 1M and 2M steps with a GPT-2 tokenizer, batch size of 512, sequence length of 1024, and a log-linear diffusion schedule.
Hardware Specification No This research has been supported by NSF Grants 2019844 and 2112471, the UT Austin Machine Learning Lab, and computing support on the Vista GPU Cluster through the Center for Generative AI (CGAI) and the Texas Advanced Computing Center (TACC) at UT Austin.
Software Dependencies No For OWT, we use the GPT-2 tokenizer (Radford et al., 2019). Our anchor network adopts the transformer architecture from SEDD (Lou et al., 2024), based on the Diffusion Transformer (Di T) (Peebles & Xie, 2023) with rotary positional embeddings (Su et al., 2024). The denoiser network uses the same base architecture but with half the number of transformer layers... We use the same Adam W optimizer for both anchor and denoising transformers with learning rate = 3e-4 and and no weight decay.
Experiment Setup Yes For LM1B, we use a context length of 128 with the BERT-base-uncased tokenizer and evaluate on the standard test split. For OWT, we use the GPT-2 tokenizer (Radford et al., 2019). Our anchor network adopts the transformer architecture from SEDD (Lou et al., 2024), based on the Diffusion Transformer (Di T) (Peebles & Xie, 2023) with rotary positional embeddings (Su et al., 2024). The denoiser network uses the same base architecture but with half the number of transformer layers... The anchor transformer network uses 12 Di T blocks. The denoising network uses 6 Di T blocks. Each block has hidden dimension = 768 and 12 attention heads. The input sequence length is 1024 for OWT and 128 for LM1B. We use the same Adam W optimizer for both anchor and denoising transformers with learning rate = 3e-4 and and no weight decay... train ADLM for 1M and 2M steps with a GPT-2 tokenizer, batch size of 512, sequence length of 1024, and a log-linear diffusion schedule... We use γ = 3e-3 and τ = 5 as our default configuration across all experiments.