Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs

Authors: Jingyao Wang, Wenwen Qiang, Zeen Song, Changwen Zheng, Hui Xiong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on various reasoning benchmarks and base models demonstrate the advantage of L2T across different tasks, boosting both reasoning effectiveness and efficiency.
Researcher Affiliation	Academia	Institute of Software Chinese Academy of Sciences, University of Chinese Academy of Sciences, The Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology
Pseudocode	Yes	Based on the above analyses, we propose Learning to Think (L2T), an information-theoretic reinforcement fine-tuning framework for LLMs (with pseudo-code in Appendix C).
Open Source Code	Yes	We have provided the code, data, and instructions in the supplemental material.
Open Datasets	Yes	We evaluate on multiple reasoning benchmarks, including AIME24-25, AMC, MATH500 [18], Minerva MATH [27], and Omni-MATH [13] (see Appendices E and G for more benchmarks, e.g., code generation). ... Human Eval consists of 164 Python programming tasks...
Dataset Splits	Yes	For Deep Scale R-1.5B-Preview ... fine-tune it on the 919 AIME questions (from 1989 to 2023); for Deep Seek-R1-Distill-Qwen-1.5B, we fine-tune on a random sample of 4,000 question-answer pairs from Numina Math [28].
Hardware Specification	Yes	All experiments are run on the A100 GPU clusters.
Software Dependencies	No	Additionally, we set the use_vllm flag to True to enable v LLM acceleration, with a GPU memory utilization of 0.8. We also utilize mixed precision training with BF16 enabled.
Experiment Setup	Yes	For optimization, we set the learning rate to 1e 6, weight decay to 0.01, and batch size to 256. ... The hyperparameters α (Eq.5) and β (Eq.3) are set to 0.8 and 0.6, respectively.