Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

Authors: Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, Armen Aghajanyan

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically study exact memorization in causal and masked language modeling, across model sizes and throughout the training process. We measure the effects of dataset size, learning rate, and model size on memorization, finding that larger language models memorize training data faster across all settings. Surprisingly, we show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process.
Researcher Affiliation Industry Kushal Tirumala Aram H. Markosyan Luke Zettlemoyer Armen Aghajanyan Meta AI Research {ktirumala,amarkos,lsz,armenag}@fb.com
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present any structured code blocks.
Open Source Code No Unfortunately, the exact code used to produce results is proprietary.
Open Datasets Yes We use two existing datasets across all our experiments: the WIKITEXT-103 benchmark containing around 103 million tokens [62], and the Ro BERTa corpus [55] used to train the original Ro BERTa model, containing around 39 billion tokens (we refer to this as the ROBERTA dataset). ... for most of our experiments we use WIKITEXT-103 benchmark which is publicly available, some of our experiments run on the ROBERTA dataset which is not publicly available
Dataset Splits No We first choose a batch of data not available in the training set, i.e. a batch of data from a validation set. ... we define overfitting occurring as the first epoch when the perplexity of the language model on a validation set increases.
Hardware Specification Yes For the smaller models (up to 2.7B) we use 32 NVIDIA A100 (40 GB) GPUS, and for the larger models (6.7B and 13B) we use 64 NVIDIA A100 (80 GB) GPUS.
Software Dependencies No We train using the Fair Seq framework [69] with Py Torch [70] as the underlying framework. For our larger models, we use the fully sharded data-parallel implementation available in Fair Scale [9] and use Aim experiment tracking [6].
Experiment Setup Yes All models are trained with Adam optimizer [48] using β1 = 0.9, β2 = 0.98, and ϵ = 10−6. We use Gelu [38] activation function. We apply 10% warmup, with a cosine decay learning rate schedule. We use mixed precision training [63] where applicable. We use a batch size of 2048 tokens and gradient accumulate up to 8192 tokens.