Evaluating Distributional Distortion in Neural Language Modeling
Authors: Benjamin LeBrun, Alessandro Sordoni, Timothy J. O'Donnell
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments reveal that LSTM and Transformer language models (i) systematically underestimate the probability of sequences drawn from the target language, and (ii) do so more severely for lessprobable sequences. |
| Researcher Affiliation | Collaboration | Benjamin Le Brun1,2, Alessandro Sordoni3,* & Timothy J. O Donnell1,2,4,* 1Mc Gill University 2Mila Quebec Artificial Intelligence Institute 3Microsoft Research 4Canada CIFAR AI Chair, Mila |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | All Transformer implementations were obtained from Huggingface, and training was done on two or four RTX-8000 GPUs (depending on model size) with mixed floating point precision. |
| Open Datasets | Yes | To define a generative model L, we train a randomly-initialized GPT2-medium on 1.5M sentences sampled from the Open Web Text corpus (Gokaslan & Cohen, 2019). |
| Dataset Splits | Yes | Models with the lowest cross-entropy loss on a withheld validation set are used in experiments unless otherwise mentioned. ... We begin by exploring model estimation error on a fixed training set Dtrain of 1M sequences sampled from p L. ... we sample a test set Dtest of 500,000 sequences from p L |
| Hardware Specification | Yes | All Transformer implementations were obtained from Huggingface, and training was done on two or four RTX-8000 GPUs (depending on model size) with mixed floating point precision. |
| Software Dependencies | No | We use the Huggingface (Wolf et al., 2020) implementations of GPT2-small, GPT2-medium and GPT2-large (Radford et al., 2019) as representative Transformer LMs. |
| Experiment Setup | Yes | For all model sizes, we use a batch size of 128 sequences. ... We use Adam Optimization with ϵ = 1e 8 and learning rates α = 5e 5, α = 4e 5 and α = 3e 5 for GPT2-small, -medium and -large respectively. |