reproducibilityindex.ai

Calibration, Entropy Rates, and Memory in Language Models

Authors: Mark Braverman, Xinyi Chen, Sham Kakade, Karthik Narasimhan, Cyril Zhang, Yi Zhang

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we show that stateof-the-art language models, including LSTMs and Transformers, are miscalibrated: the entropy rates of their generations drift dramatically upward over time. We then provide provable methods to mitigate this phenomenon. Furthermore, we show how this calibration-based approach can also be used to measure the amount of memory that language models use for prediction.
Researcher Affiliation	Collaboration	1Department of Computer Science, Princeton University, Princeton, New Jersey, USA 2Google AI Princeton, Princeton, New Jersey, USA 3University of Washington, Allen School of Computer Science and Engineering and Department of Statistics, Seattle, Washington, USA 4Microsoft Research, New York, New York, USA.
Pseudocode	Yes	Algorithm 1 (Inefﬁcient) Entropy Rate Calibration; Algorithm 2 Local Entropy Rate Calibration
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it include a direct link to a code repository.
Open Datasets	Yes	Model Corpus Test ppl. e Ent Rate 1) AWD-LSTM PTB 58.3 93.1 2) CNN-LSTM GBW 29.8 49.4 3) Transformer GBW 28.1 34.7 4) GPT-2 Web Text 23.7 61.2 ... Left: LSTM trained on Penn Treebank. Right: GPT-2 Transformer. ... Table 2: Sample generations from a calibrated, state-of-theart Transformer model trained on the GBW dataset
Dataset Splits	No	The paper mentions 'holdout validation set' in Table 2, but it does not specify exact split percentages, absolute sample counts, or reference predefined splits with citations for how the data was partitioned for training, validation, and testing.
Hardware Specification	No	The paper does not specify any hardware details (e.g., specific GPU/CPU models, memory amounts, or cloud instance types) used for running its experiments.
Software Dependencies	No	The paper does not provide specific names and version numbers for ancillary software components or libraries used in the experiments.
Experiment Setup	No	The paper states that 'Model and implementation details are in the supplementary material' but does not provide specific experimental setup details, such as concrete hyperparameter values or training configurations, within the main text.