Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Split Gibbs Discrete Diffusion Posterior Sampling

Authors: Wenda Chu, Zihui Wu, Yifan Chen, Yang Song, Yisong Yue

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments to demonstrate the effectiveness of our algorithm on various posterior sampling tasks in discrete-state spaces, including discrete inverse problems and conditional generation guided by a reward function. [...] We validate SGDD on diverse inverse problems and reward-guided generation problems that involve discrete data. [...] For instance, we achieve a 42% higher median activity in enhancer DNA design, an 8.36 d B improvement of PSNR values in solving the XOR inverse problem on MNIST, and over 2ˆ smaller Hellinger distance in music infilling, compared to existing methods.
Researcher Affiliation	Collaboration	Wenda Chu1 Zihui Wu1 Yifan Chen2 Yang Song3 Yisong Yue1 1California Institute of Technology 2New York University 3Open AI
Pseudocode	Yes	Algorithm 1 Split Gibbs Discrete Diffusion Posterior Sampling (SGDD) Require: Concrete score model sθ, measurement y, noise schedule tηku K 1 k 0 . Initialize x0 P X d for k 0, . . . , K 1 do Likelihood step following Eq. (13): zpkq πpx xpkq, z; ηkq. Prior step following Eq. (11): xpk 1q πpx, z zpkq; ηkq end for Return xp Kq
Open Source Code	Yes	Data are all open-domain and the code is accessible in Supplementary files.
Open Datasets	Yes	We conduct experiments to demonstrate the effectiveness of our algorithm on various posterior sampling tasks in discrete-state spaces, including discrete inverse problems and conditional generation guided by a reward function. [...] We train a discrete diffusion model [29] on a publicly available dataset from Gosai et al. [15] that consists of 700k DNA sequences. [...] We evaluate on a discretized image domain. Specifically, we convert the MNIST dataset [25] to binary strings [...] We conduct experiments on monophonic songs from the Lakh pianoroll dataset [11].
Dataset Splits	No	For synthetic data, we use a closed-form concrete score function. For real datasets, we train a SEDD [29] model with the uniform transition kernel (details in Appendix C). [...] We use 1k binary images from the test set of MNIST and calculate the peak signal-to-noise ratio (PSNR) of the reconstructed image. [...] We run experiments on 100 samples in the test set and report the quantitative results in Table 5.
Hardware Specification	Yes	The runtime is amortized on a batch size of 10 samples with one NVIDIA A100 GPU.
Software Dependencies	No	We use the SEDD small architecture with around 90M parameters for all experiments, and the models are trained with Adam W [28] with batch size 32 and a learning rate of 3 ˆ 10 4.
Experiment Setup	Yes	SGDD implementation details. We implement our method with a total of K iterations. In each prior sampling step, we simulate the reverse continuous-time Markov chain with H 20 steps. Additional details on hyperparameter choices are provided in Appendix C.3. [...] We use an annealing noise schedule of ηk ηk{K min η1 k{K max with ηmin 10 4 and ηmax 20. We run SGDD for K iterations. In each likelihood sampling step, we run Metropolis-Hastings for T steps, while in each prior sampling step, we run a few-step Euler discrete diffusion sampler with H steps. The hyperparameters used for each experiment are listed in Table 7.