Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
Authors: Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, Hao Peng
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Table 2: Performance comparison of unsupervised finetuning (EM-FT) and various rewarding methods in EM-RL with supervised finetuning and RL. Italics, Bold indicates performance improvement over GRPO and SC-RL (self-consistency RL), respectively. Dash line (" ") denotes that self-consistency is inapplicable. FLOPs are reported as 1017 ( D.4) |
| Researcher Affiliation | Academia | Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, Hao Peng University of Illinois Urbana-Champaign EMAIL |
| Pseudocode | Yes | Algorithm 1 Inference-time Adaptive Temperature Input: Logits zt; Initial maximum temperature τmax_init; Initial minimum temperature τmin_init > 0; Maximum iterations M; Tolerance ϵtol > 0; Target entropy reduction ratio α (0, 1); Target entropy threshold δ > 0. Output: Scaled logits z t. 1: function COMPUTEADAPTIVETEMPERATURE(zt, τmax_init, τmin_init, M, ϵtol, α, δ) 2: Pinitial Softmax(zt) 3: Hinitial Entropy(Pinitial) 4: Htarget max(δ, α Hinitial) 5: τfinal 1.0 6: if Hinitial > δ then 7: τlow τmin_init 8: τhigh τmax_init 9: for iteration 1 to M do 10: τmid (τhigh + τlow)/2 11: Pcurrent Softmax(zt/τmid) 12: Hcurrent Entropy(Pcurrent) 13: if |Hcurrent Htarget| < ϵtol then 14: τfinal τmid 15: break 16: else if Hcurrent < Htarget then 17: τlow τmid 18: else 19: τhigh τmid 20: end if 21: τfinal τmid 22: end for 23: end if 24: z t zt/τfinal 25: return z t 26: end function |
| Open Source Code | Yes | 1Code:https://github.com/shivamag125/EM_PT |
| Open Datasets | Yes | Evaluation: For our training experiments, we evaluate on math and coding tasks using [13]Math500 [29], AMC [43], AIME 2024 [43], Minerva math (Minerva) [42], Olympiad Bench (Olymp.) [28], Leet Code (Leet C) [26], and Live Code Bench-v2 (Live C) [34]. For inference time scaling, we additionally evaluate on Scicode [78] and UGPhysics [88]. ... Training data, models, and methods: We construct the training data for math and coding by randomly sampling 35K prompts from Numina math [43] and 25K prompts from Eurus-2 coding split [13], respectively. |
| Dataset Splits | Yes | We use the validation set from AI-MO’s AMC dataset [43], which contains 83 problems extracted from AMC 12 (2022 and 2023). ... We use the validation set from AI-MO’s AIME dataset [43], which includes 90 problems drawn from AIME 22, AIME 23, and AIME 24. ... MATH 500 contains 500 prompts sampled from the 5,000 test problems in the original MATH dataset [29]. |
| Hardware Specification | Yes | All experiments are done using 4x GH200 Nvidia GPUs. |
| Software Dependencies | No | We use Verl [69] to train. |
| Experiment Setup | Yes | We set the number of rollouts to N = 4, batch size to 512, learning rate to 1e 6. We use a KL regularizer with small coefficient β = 0.001 which does not effect entropy minimization. |