Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Next Semantic Scale Prediction via Hierarchical Diffusion Language Models

Authors: Cai Zhou, Chenyu Wang, Dinghuai Zhang, Shangyuan Tong, Yifei Wang, Stephen Bates, Tommi Jaakkola

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines. 4 Experiments
Researcher Affiliation	Collaboration	1Massachusetts Institute of Technology 2Microsoft Research 3Mila Quebec AI Institute
Pseudocode	Yes	Algorithm 1: Semantic Clustering with Size Constraints (Part 1: Setup and initialization) Algorithm 2: Semantic Clustering with Size Constraints (Part 2: Update and Size Constraints)
Open Source Code	Yes	Code is available at https://github.com/zhouc20/HDLM.
Open Datasets	Yes	In our experiments, we focus on language modeling, training the proposed HDLM algorithm on the widely used Open Web Text (OWT) [8] dataset following [34]. ... We additionally pretrain our model on LM1B [5], a smaller dataset that is popular for text generation.
Dataset Splits	Yes	We take 100000 data out of 8013769 data as the validation set. Following [34], we use a context length of 512 tokens and do not use sentence packing.
Hardware Specification	Yes	We utilize 8 NVIDIA A100/H100 80GB GPUs and employ mixed precision training in bf16 format.
Software Dependencies	No	The paper mentions "GPT2 [25] tokenizer" and "Adam optimizer [16]" but does not provide specific version numbers for these or other software libraries/tools used for implementation, which is required for reproducible software dependency information.
Experiment Setup	Yes	Following the setting of [34], all models are trained with a context size of 512 tokens and a batch size of 512 for 500k steps, resulting in a total of 131B training tokens. We utilize 8 NVIDIA A100/H100 80GB GPUs and employ mixed precision training in bf16 format. For optimization, we use the Adam optimizer [16] (β = (0.9, 0.99), ϵ = 10 9) with a learning rate of 5 10 4. The learning rate is warmed up linearly for the first 10k steps and then decayed using a cosine schedule to 10% of the initial learning rate. We apply a weight decay of 0.02 and gradient clipping to a norm of 1.0. We clip the largest value of loss weights wt,m, wt,c to 2.0 or 10.0 in training for stable optimization, yet do not clip while evaluating the elbo for fair comparison.