Calibration, Entropy Rates, and Memory in Language Models
Authors: Mark Braverman, Xinyi Chen, Sham Kakade, Karthik Narasimhan, Cyril Zhang, Yi Zhang
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that stateof-the-art language models, including LSTMs and Transformers, are miscalibrated: the entropy rates of their generations drift dramatically upward over time. We then provide provable methods to mitigate this phenomenon. Furthermore, we show how this calibration-based approach can also be used to measure the amount of memory that language models use for prediction. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Princeton University, Princeton, New Jersey, USA 2Google AI Princeton, Princeton, New Jersey, USA 3University of Washington, Allen School of Computer Science and Engineering and Department of Statistics, Seattle, Washington, USA 4Microsoft Research, New York, New York, USA. |
| Pseudocode | Yes | Algorithm 1 (Inefficient) Entropy Rate Calibration; Algorithm 2 Local Entropy Rate Calibration |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it include a direct link to a code repository. |
| Open Datasets | Yes | Model Corpus Test ppl. e Ent Rate 1) AWD-LSTM PTB 58.3 93.1 2) CNN-LSTM GBW 29.8 49.4 3) Transformer GBW 28.1 34.7 4) GPT-2 Web Text 23.7 61.2 ... Left: LSTM trained on Penn Treebank. Right: GPT-2 Transformer. ... Table 2: Sample generations from a calibrated, state-of-theart Transformer model trained on the GBW dataset |
| Dataset Splits | No | The paper mentions 'holdout validation set' in Table 2, but it does not specify exact split percentages, absolute sample counts, or reference predefined splits with citations for how the data was partitioned for training, validation, and testing. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., specific GPU/CPU models, memory amounts, or cloud instance types) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific names and version numbers for ancillary software components or libraries used in the experiments. |
| Experiment Setup | No | The paper states that 'Model and implementation details are in the supplementary material' but does not provide specific experimental setup details, such as concrete hyperparameter values or training configurations, within the main text. |