reproducibilityindex.ai

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

Authors: Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, Armen Aghajanyan

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically study exact memorization in causal and masked language modeling, across model sizes and throughout the training process. We measure the effects of dataset size, learning rate, and model size on memorization, ﬁnding that larger language models memorize training data faster across all settings. Surprisingly, we show that larger models can memorize a larger portion of the data before over-ﬁtting and tend to forget less throughout the training process.
Researcher Affiliation	Industry	Kushal Tirumala Aram H. Markosyan Luke Zettlemoyer Armen Aghajanyan Meta AI Research {ktirumala,amarkos,lsz,armenag}@fb.com
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present any structured code blocks.
Open Source Code	No	Unfortunately, the exact code used to produce results is proprietary.
Open Datasets	Yes	We use two existing datasets across all our experiments: the WIKITEXT-103 benchmark containing around 103 million tokens [62], and the Ro BERTa corpus [55] used to train the original Ro BERTa model, containing around 39 billion tokens (we refer to this as the ROBERTA dataset). ... for most of our experiments we use WIKITEXT-103 benchmark which is publicly available, some of our experiments run on the ROBERTA dataset which is not publicly available
Dataset Splits	No	We ﬁrst choose a batch of data not available in the training set, i.e. a batch of data from a validation set. ... we deﬁne overﬁtting occurring as the ﬁrst epoch when the perplexity of the language model on a validation set increases.
Hardware Specification	Yes	For the smaller models (up to 2.7B) we use 32 NVIDIA A100 (40 GB) GPUS, and for the larger models (6.7B and 13B) we use 64 NVIDIA A100 (80 GB) GPUS.
Software Dependencies	No	We train using the Fair Seq framework [69] with Py Torch [70] as the underlying framework. For our larger models, we use the fully sharded data-parallel implementation available in Fair Scale [9] and use Aim experiment tracking [6].
Experiment Setup	Yes	All models are trained with Adam optimizer [48] using β1 = 0.9, β2 = 0.98, and ϵ = 10−6. We use Gelu [38] activation function. We apply 10% warmup, with a cosine decay learning rate schedule. We use mixed precision training [63] where applicable. We use a batch size of 2048 tokens and gradient accumulate up to 8192 tokens.