Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs

Authors: Jingyao Wang, Wenwen Qiang, Zeen Song, Changwen Zheng, Hui Xiong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results on various reasoning benchmarks and base models demonstrate the advantage of L2T across different tasks, boosting both reasoning effectiveness and efficiency.
Researcher Affiliation Academia Institute of Software Chinese Academy of Sciences, University of Chinese Academy of Sciences, The Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology
Pseudocode Yes Based on the above analyses, we propose Learning to Think (L2T), an information-theoretic reinforcement fine-tuning framework for LLMs (with pseudo-code in Appendix C).
Open Source Code Yes We have provided the code, data, and instructions in the supplemental material.
Open Datasets Yes We evaluate on multiple reasoning benchmarks, including AIME24-25, AMC, MATH500 [18], Minerva MATH [27], and Omni-MATH [13] (see Appendices E and G for more benchmarks, e.g., code generation). ... Human Eval consists of 164 Python programming tasks...
Dataset Splits Yes For Deep Scale R-1.5B-Preview ... fine-tune it on the 919 AIME questions (from 1989 to 2023); for Deep Seek-R1-Distill-Qwen-1.5B, we fine-tune on a random sample of 4,000 question-answer pairs from Numina Math [28].
Hardware Specification Yes All experiments are run on the A100 GPU clusters.
Software Dependencies No Additionally, we set the use_vllm flag to True to enable v LLM acceleration, with a GPU memory utilization of 0.8. We also utilize mixed precision training with BF16 enabled.
Experiment Setup Yes For optimization, we set the learning rate to 1e 6, weight decay to 0.01, and batch size to 256. ... The hyperparameters α (Eq.5) and β (Eq.3) are set to 0.8 and 0.6, respectively.