Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Entropy Calibration of Language Models

Authors: Steven Cao, Gregory Valiant, Percy Liang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters.
Researcher Affiliation Academia Steven Cao Stanford University EMAIL Gregory Valiant Stanford University EMAIL Percy Liang Stanford University EMAIL
Pseudocode Yes Algorithm 1 Future entropy scaling Inputs: model ˆp, length T, vocab V, future entropy fitting algorithm A, future entropy dataset size n, sample size m, prompt distribution q, true conditional distribution p , optimization tolerance ε
Open Source Code Yes https://github.com/stevenxcao/entropy-calibration
Open Datasets Yes We study four model families ... applied to the following three datasets: (a) Wiki Text-103 (Merity et al., 2017): given 128 tokens of context from a Wikipedia passage, the model is tasked with completing the passage. (b) Writing Prompts (Fan et al., 2018): given a prompt from r/writingprompts along with 128 tokens of context from a human-written story, the model is tasked with completing the story. (c) Code Contests (Li et al., 2022): given a coding problem from one of five websites and 128 tokens of context from a human-written solution, the model is tasked with completing the solution.
Dataset Splits No In each setting, we use 5000 examples and limit samples to 1024 tokens; see Appendix C for more experimental details.
Hardware Specification Yes All experiments are run on 1-4 NVIDIA-A100-SXM4-80GB GPUs, or 1-4 NVIDIA RTX 6000 Ada Generation 49.1GB GPUs.
Software Dependencies No For generation we use vLLM (Kwon et al., 2023) with the xFormers attention kernel (Lefaudeux et al., 2022) and no quantization, and we use Hugging Face (Wolf et al., 2020) with 4-bit quantization (Dettmers et al., 2022) to compute logprobs. All experiments are run using PyTorch (Paszke et al., 2019), and all plots are produced using Matplotlib (Hunter, 2007).
Experiment Setup Yes In each setting, we use 5000 examples and limit samples to 1024 tokens. ... we compare the model with temperature 1.0 to that with temperature 0.95, 0.9, 0.85, or 0.8.