Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Non-Markovian Discrete Diffusion with Causal Language Models

Authors: Yangtian Zhang, Sizhuang He, Daniel S Levine, Lawrence Zhao, David Zhang, Syed Rizvi, Shiyang Zhang, Emanuele Zappala, Rex Ying, David van Dijk

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Ca DDi outperforms state-of-the-art discrete diffusion baselines on natural-language benchmarks, substantially narrowing the remaining gap to large autoregressive transformers. [...] Quantitative results show that Ca DDi outperforms recent discrete diffusion models, achieving lower generative perplexity on language datasets and stronger reasoning capabilities when leveraging a pretrained LLM. [...] We compare Ca DDi against several well-established discrete diffusion models: D3PM [2], SEDD [37], MDLM [44], UDLM [46], and Discrete Flow Matching [21]. [...] As shown in Table 1, Ca DDi-AR and Ca DDi consistently outperform baselines in generative perplexity across the three language model oracles. [...] To assess the impact of key architectural and training design choices, we conduct an ablation study on a subset of the Text8 dataset...
Researcher Affiliation Academia Yale University, New Haven, CT, USA EMAIL EMAIL EMAIL EMAIL
Pseudocode Yes Algorithm 1 Inference for Non-Markovian Discrete Diffusion
Open Source Code No Answer: [No] Justification: The paper doesn t release its code at this time but great details on experiment setups are provided and will release its code at an appropriate time.
Open Datasets Yes One Billion Words Dataset We evaluate Ca DDi s generative capabilities on the One Billion Words dataset (LM1B) [7]... Text8 Dataset Following prior work [2, 48], we trained a discrete diffusion model on short text chunks of length 256 from the text8 dataset... General Reasoning Datasets with Fine-tuned LLM ...ARC-Challenge, ARC-Easy [14], Bool Q [13], PIQA [4], RACE [33], Social IQA [45], and LAMBADA [40]... Amazon Polarity dataset [38]
Dataset Splits Yes Text8 Dataset Following prior work [2, 48], we trained a discrete diffusion model on short text chunks of length 256 from the text8 dataset. We use the same dataset split as in previous studies, training on the training set and reporting performance on the test set using the standard bits-per-dimension (BPD) metric. [...] We use the first 90% of the dataset for training and the remaining 10% for validation and testing.
Hardware Specification Yes Our models are trained on 4 NVIDIA H100 GPUs with mixed precision.
Software Dependencies No Dataset Preprocessing. We follow the preprocessing setup introduced in Diffusion BERT [26], using the One Billion Word Benchmark [7]. Sentences are tokenized using the bert-base-uncased tokenizer with a vocabulary size of 30,522. All models are based on a 12-layer Transformer decoder architecture... Models are trained using Adam W with a learning rate of 3e-4, 2500 warm-up steps. We evaluate generative perplexity using pretrained oracle models (GPT2, LLa MA-2-7B, and LLa MA-3-3B). [...] We measure sentiment accuracy (SA) using a fine-tuned Distil BERT classifier.
Experiment Setup Yes Model Configuration. All models are based on a 12-layer Transformer decoder architecture with a hidden size of 768 and 12 attention heads. For D3PM [2], MDLM [44], and SEDD [37], we adopt an absorbing diffusion kernel with a log-linear noise schedule. For Ca DDi and Ca DDi-AR, we use the absorbing-state forward kernel described in Section A.5, with total diffusion steps set to T = 64. [...] Ca DDi uses a context window of 5 and applies latent truncation as described in Section C.1.1. Training Details. Models are trained using Adam W with a learning rate of 3e-4, 2500 warm-up steps. All models use a batch size of 512 and train for 1000K steps. [...] Fine-tuning Setup. [...] a learning rate of 5e-5, a batch size of 64 with gradient accumulation, and 20K total steps. We adopt the absorbing kernel formulation with simplified ELBO as the training loss.