Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Authors: Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, Hao Peng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Table 2: Performance comparison of unsupervised finetuning (EM-FT) and various rewarding methods in EM-RL with supervised finetuning and RL. Italics, Bold indicates performance improvement over GRPO and SC-RL (self-consistency RL), respectively. Dash line (" ") denotes that self-consistency is inapplicable. FLOPs are reported as 1017 ( D.4)
Researcher Affiliation Academia Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, Hao Peng University of Illinois Urbana-Champaign EMAIL
Pseudocode Yes Algorithm 1 Inference-time Adaptive Temperature Input: Logits zt; Initial maximum temperature τmax_init; Initial minimum temperature τmin_init > 0; Maximum iterations M; Tolerance ϵtol > 0; Target entropy reduction ratio α (0, 1); Target entropy threshold δ > 0. Output: Scaled logits z t. 1: function COMPUTEADAPTIVETEMPERATURE(zt, τmax_init, τmin_init, M, ϵtol, α, δ) 2: Pinitial Softmax(zt) 3: Hinitial Entropy(Pinitial) 4: Htarget max(δ, α Hinitial) 5: τfinal 1.0 6: if Hinitial > δ then 7: τlow τmin_init 8: τhigh τmax_init 9: for iteration 1 to M do 10: τmid (τhigh + τlow)/2 11: Pcurrent Softmax(zt/τmid) 12: Hcurrent Entropy(Pcurrent) 13: if |Hcurrent Htarget| < ϵtol then 14: τfinal τmid 15: break 16: else if Hcurrent < Htarget then 17: τlow τmid 18: else 19: τhigh τmid 20: end if 21: τfinal τmid 22: end for 23: end if 24: z t zt/τfinal 25: return z t 26: end function
Open Source Code Yes 1Code:https://github.com/shivamag125/EM_PT
Open Datasets Yes Evaluation: For our training experiments, we evaluate on math and coding tasks using [13]Math500 [29], AMC [43], AIME 2024 [43], Minerva math (Minerva) [42], Olympiad Bench (Olymp.) [28], Leet Code (Leet C) [26], and Live Code Bench-v2 (Live C) [34]. For inference time scaling, we additionally evaluate on Scicode [78] and UGPhysics [88]. ... Training data, models, and methods: We construct the training data for math and coding by randomly sampling 35K prompts from Numina math [43] and 25K prompts from Eurus-2 coding split [13], respectively.
Dataset Splits Yes We use the validation set from AI-MO’s AMC dataset [43], which contains 83 problems extracted from AMC 12 (2022 and 2023). ... We use the validation set from AI-MO’s AIME dataset [43], which includes 90 problems drawn from AIME 22, AIME 23, and AIME 24. ... MATH 500 contains 500 prompts sampled from the 5,000 test problems in the original MATH dataset [29].
Hardware Specification Yes All experiments are done using 4x GH200 Nvidia GPUs.
Software Dependencies No We use Verl [69] to train.
Experiment Setup Yes We set the number of rollouts to N = 4, batch size to 512, learning rate to 1e 6. We use a KL regularizer with small coefficient β = 0.001 which does not effect entropy minimization.