Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Next Semantic Scale Prediction via Hierarchical Diffusion Language Models
Authors: Cai Zhou, Chenyu Wang, Dinghuai Zhang, Shangyuan Tong, Yifei Wang, Stephen Bates, Tommi Jaakkola
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines. 4 Experiments |
| Researcher Affiliation | Collaboration | 1Massachusetts Institute of Technology 2Microsoft Research 3Mila Quebec AI Institute |
| Pseudocode | Yes | Algorithm 1: Semantic Clustering with Size Constraints (Part 1: Setup and initialization) Algorithm 2: Semantic Clustering with Size Constraints (Part 2: Update and Size Constraints) |
| Open Source Code | Yes | Code is available at https://github.com/zhouc20/HDLM. |
| Open Datasets | Yes | In our experiments, we focus on language modeling, training the proposed HDLM algorithm on the widely used Open Web Text (OWT) [8] dataset following [34]. ... We additionally pretrain our model on LM1B [5], a smaller dataset that is popular for text generation. |
| Dataset Splits | Yes | We take 100000 data out of 8013769 data as the validation set. Following [34], we use a context length of 512 tokens and do not use sentence packing. |
| Hardware Specification | Yes | We utilize 8 NVIDIA A100/H100 80GB GPUs and employ mixed precision training in bf16 format. |
| Software Dependencies | No | The paper mentions "GPT2 [25] tokenizer" and "Adam optimizer [16]" but does not provide specific version numbers for these or other software libraries/tools used for implementation, which is required for reproducible software dependency information. |
| Experiment Setup | Yes | Following the setting of [34], all models are trained with a context size of 512 tokens and a batch size of 512 for 500k steps, resulting in a total of 131B training tokens. We utilize 8 NVIDIA A100/H100 80GB GPUs and employ mixed precision training in bf16 format. For optimization, we use the Adam optimizer [16] (β = (0.9, 0.99), ϵ = 10 9) with a learning rate of 5 10 4. The learning rate is warmed up linearly for the first 10k steps and then decayed using a cosine schedule to 10% of the initial learning rate. We apply a weight decay of 0.02 and gradient clipping to a norm of 1.0. We clip the largest value of loss weights wt,m, wt,c to 2.0 or 10.0 in training for stable optimization, yet do not clip while evaluating the elbo for fair comparison. |