Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

UFO-RL: Uncertainty-Focused Optimization for Efficient Reinforcement Learning Data Selection

Authors: Yang Zhao, Kai Xiong, Xiao Ding, Li Du, Yangou Ouyang, Zhouhao Sun, Jiannan Guan, Wenbin Zhang, Bin Liu, Dong Hu, Bing Qin, Ting Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimentation across diverse LLMs and mathematical benchmarks demonstrates that training with a mere 10% of the data, carefully selected by UFO-RL, yields performance comparable to or even surpassing that of full-data training. Furthermore, this targeted data selection results in up to a 16 reduction in overall training time, concurrently enhancing training stability and improving generalization capabilities.
Researcher Affiliation Collaboration Research Center for Social Computing and Interactive Robotics Harbin Institute of Technology, China Beijing Academy of Artificial Intelligence, Beijing, China Du Xiaoman Technology (Beijing) Co., Ltd.
Pseudocode No The paper describes methods and processes in paragraph form, such as 'Confidence Estimation via Average Log-Softmax' and 'Confidence-Based Data Filtering', but does not present them as structured pseudocode or algorithm blocks.
Open Source Code Yes The source code for this work is publicly available at: https://github.com/zy125413/UFO_RL.
Open Datasets Yes Data selection was conducted on the commonly used RL datasets GSM8K [1] and DAPO-MATH-17K [23]. ...For evaluation, we assessed models on the GSM8K test set, which contains elementary math problems. To gauge performance on more complex mathematical and quantitative reasoning tasks, we included Math500[11]. Furthermore, MMLU [5]was used as a general-domain text understanding benchmark...
Dataset Splits Yes Next, to analyze how this proxy relates to learning, we sorted all training examples based on their computed pi values and partitioned them into K = 10 equally-sized bins G0, . . . , G9. Each bin contains one-tenth of the total number of examples in the dataset. ... Ultimately, the top 10% of samples, as ranked by this fuzziness score , are selected from the candidate dataset to constitute the training data for RL.
Hardware Specification Yes During the training phase, all experiments were conducted using the open-r1 framework, executed on a computing cluster equipped with 8 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions 'Deep Speed Zero-2 optimization technology', 'vLLM', and the 'open-r1 framework' but does not specify version numbers for these software components.
Experiment Setup Yes Key parameters included a learning rate of 1e-6. ... During this process, the temperature was set to 1. ... In the model evaluation phase, all our experiments employed a zero-shot evaluation method. To ensure fairness, the temperature was set to 0.