Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

UFT: Unifying Supervised and Reinforcement Fine-Tuning

Authors: Mingyang Liu, Gabriele Farina, Asuman Ozdaglar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	3. Empirical Validation Across Model Scales and Tasks. We evaluate the algorithms by training Qwen2.5-0.5/1.5/3B [Qwen et al., 2025] and Llama3.2-1/3B [Grattafiori et al., 2024] on Countdown [Wikipedia contributors, 2025, Pan et al., 2025], MATH [Hendrycks et al., 2021], and the Knights and Knaves logic puzzle (Logic) [Xie et al., 2025]. UFT consistently outperforms previous methods, showing robustness across domains and models (cf. Section 5).
Researcher Affiliation	Academia	Mingyang Liu, Gabriele Farina & Asuman Ozdaglar LIDS, EECS Massachusetts Institute of Technology Cambridge, MA 02139, USA EMAIL
Pseudocode	Yes	The corresponding pseudocode is provided in Algorithm 1. The full algorithm can be found in Algorithm 2.
Open Source Code	Yes	The source code is available at https://github.com/liumy2010/UFT.
Open Datasets	Yes	We evaluate the algorithms by training Qwen2.5-0.5/1.5/3B [Qwen et al., 2025] and Llama3.2-1/3B [Grattafiori et al., 2024] on Countdown [Wikipedia contributors, 2025, Pan et al., 2025], MATH [Hendrycks et al., 2021], and the Knights and Knaves logic puzzle (Logic) [Xie et al., 2025].
Dataset Splits	No	The paper mentions 'Training Batch Size 256', 'Validation Batch Size 1312', and 'Mini-batch Size 64' in Table 1, and refers to 'test dataset' in general, but does not specify the initial training, validation, and test splits for the datasets used.
Hardware Specification	No	The project costs roughly $10,000 GPU hours. This indicates that GPUs were used, but no specific GPU models (e.g., NVIDIA A100, Tesla V100) or other hardware specifications are provided for reproducibility.
Software Dependencies	No	The experiment is based on VERL [Sheng et al., 2024] and Tiny Zero [Pan et al., 2025]. While these frameworks are mentioned, specific version numbers for them or any underlying software libraries (e.g., Python, PyTorch, CUDA versions) are not provided.
Experiment Setup	Yes	Table 1: The hyperparameters for training on different datasets. The table specifies 'Training Batch Size 256', 'Validation Batch Size 1312', 'Mini-batch Size 64', 'Hint Length 5', 'Learning Rate 10 6', 'β 0.001', 'T 500', 'Thint 300', 'Number of Rollouts 4', 'Context Window (Prompt) Countdown: 256 MATH(3,4,5): 1024 Logic: 1024', 'Context Window (Response) 1024', 'plow 0.05', 'phigh 0.95', 'SFT Epochs 5', 'Accuracy Reward 1.0', 'Format Correctness Reward 0.1', 'Incorrect Reward 0.0'.