Evaluating Distributional Distortion in Neural Language Modeling

Authors: Benjamin LeBrun, Alessandro Sordoni, Timothy J. O'Donnell

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments reveal that LSTM and Transformer language models (i) systematically underestimate the probability of sequences drawn from the target language, and (ii) do so more severely for lessprobable sequences.
Researcher Affiliation Collaboration Benjamin Le Brun1,2, Alessandro Sordoni3,* & Timothy J. O Donnell1,2,4,* 1Mc Gill University 2Mila Quebec Artificial Intelligence Institute 3Microsoft Research 4Canada CIFAR AI Chair, Mila
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No All Transformer implementations were obtained from Huggingface, and training was done on two or four RTX-8000 GPUs (depending on model size) with mixed floating point precision.
Open Datasets Yes To define a generative model L, we train a randomly-initialized GPT2-medium on 1.5M sentences sampled from the Open Web Text corpus (Gokaslan & Cohen, 2019).
Dataset Splits Yes Models with the lowest cross-entropy loss on a withheld validation set are used in experiments unless otherwise mentioned. ... We begin by exploring model estimation error on a fixed training set Dtrain of 1M sequences sampled from p L. ... we sample a test set Dtest of 500,000 sequences from p L
Hardware Specification Yes All Transformer implementations were obtained from Huggingface, and training was done on two or four RTX-8000 GPUs (depending on model size) with mixed floating point precision.
Software Dependencies No We use the Huggingface (Wolf et al., 2020) implementations of GPT2-small, GPT2-medium and GPT2-large (Radford et al., 2019) as representative Transformer LMs.
Experiment Setup Yes For all model sizes, we use a batch size of 128 sequences. ... We use Adam Optimization with ϵ = 1e 8 and learning rates α = 5e 5, α = 4e 5 and α = 3e 5 for GPT2-small, -medium and -large respectively.