Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods

Authors: Oussama Zekri, Nicolas Boulle

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform numerical experiments on DNA fine-tuning and natural language tasks to demonstrate the performance of our methods. We achieve state-of-the-art DNA results: Pred-Activity 7.64 and ATAC-Acc 99.9%, topping prior RL and guidance baselines with lower run-to-run variance. [...] 4 Experiments [...] 4.1 Language modeling [...] 4.2 DNA sequence modeling
Researcher Affiliation Academia Oussama Zekri CREST, ENSAE Institut Polytechnique de Paris France EMAIL Nicolas Boullé Department of Mathematics Imperial College London United Kingdom EMAIL
Pseudocode Yes Algorithm 1 SEPO 1: Require: CTMC Q θ, iteration S, epoch K 2: Set θ0 and θold to θpre 3: for s [1, , S] do 4: Sample from πθold with Q θold 5: Compute the reward and the advantage 6: Optimize θs with ℓA for K epochs 7: Set θold to θs 8: end for 9: Output: θS+1
Open Source Code Yes Our code is available at https://github.com/ozekri/SEPO.
Open Datasets Yes The reward model, built from GPT-2 architecture and trained on the HH-RLHF dataset (Bai et al., 2022), provides reward signals to guide fine-tuning (see Figs. 4 and 5). [...] We employ the pretrained model of Wang et al. (2025), a masked discrete diffusion model (Sahoo et al., 2024) trained on 700k regulatory DNA sequences (200 bp) from the Gosai dataset (Gosai et al., 2023). [...] We use the 153 prompts from the Awesome Chat GPT Prompts dataset (Akın, 2023).
Dataset Splits Yes We use half of the HH-RLHF dataset (Bai et al., 2022) to train the SFT model in an autoregressive fashion, and the other half to train the reward model, which has a logistic output R(x). [...] they split the data by chromosomes into two disjoint subsets, each covering half of the 23 human chromosomes. Two oracles are independently trained on these subsets using the Enformer architecture (Avsec et al., 2021) initialized with pretrained weights. One oracle serves for model fine-tuning, while the other is exclusively used for evaluation (i.e., Pred-Activity in Table 2).
Hardware Specification Yes On a single NVIDIA Ge Force RTX 3090 (24GB) GPU, we report the following computational timings. [...] All experiments were run on an internal cluster on a single Nvidia RTX 3090 Ti GPU with 24GB of memory.
Software Dependencies No The paper mentions software like the Adam optimizer (Kingma, 2014) and architectures like GPT-2, but does not provide specific version numbers for general software dependencies (e.g., Python, PyTorch, CUDA, etc.) used in their implementation.
Experiment Setup Yes We fine-tune two versions of SEDD Medium, with a different number of denoising steps T to measure the impact on the quality of the fine-tuning. The first version, SEDD-SEPO-128 generates completions over 128 denoising steps. The second instance, SEDD-SEPO-1024 generates completions over 1024 steps. Both versions are trained for 7k steps on the HH-RLHF dataset. [...] we set ϵ = 0.2 in Eq. (9). [...] sequence generation is performed using 128 sampling steps for both SEPO and SEPO with gradient flow. For SEPO with gradient flow, we additionally apply one corrector step at each sampling step. In both cases, we include a KL regularization term, with α = 0.05 controlling its strength. Gradient truncation is applied at step 10 [...] We employ the Adam optimizer (Kingma, 2014) with a learning rate of 10-4 and use a clipping ratio of ϵ = 0.2 in SEPO. The batch size and the number of output groups are both set to 8, and we use K = 2 in Algorithm 1. For computing each qy(θ), we draw M = 4 SNIS samples, following Section 3.