Non-Vacuous Generalization Bounds for Large Language Models

Authors: Sanae Lotfi, Marc Anton Finzi, Yilun Kuang, Tim G. J. Rudner, Micah Goldblum, Andrew Gordon Wilson

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide the first non-vacuous generalization bounds for pretrained large language models (LLMs), indicating that language models are capable of discovering regularities that generalize to unseen data. In particular, we derive a compression bound that is valid for the unbounded log-likelihood loss using prediction smoothing, and we extend the bound to handle subsampling, making bound computation 900 times faster on massive datasets. To achieve the extreme level of compression required for nonvacuous bounds, we devise Sub Lo RA, a simple low-dimensional nonlinear parameterization that leads to non-vacuous generalization bounds for very large models with up to 849 million parameters. Finally, we use our bounds to understand LLM generalization and find that larger models have better generalization bounds and are more compressible than smaller models.
Researcher Affiliation Academia 1New York University 2Carnegie Mellon University. Correspondence to: Sanae Lotfi <sl8160@nyu.edu>, Andrew Gordon Wilson <andrewgw@cims.nyu.edu>.
Pseudocode Yes Algorithm 1 Compute Finite Hypothesis Bound.
Open Source Code Yes We make our code available here.
Open Datasets Yes We pretrain this model on the training split of the Open Web Text dataset2 using Sub Lo RA, Lo RA, Subspace training. The link to Open Web Text dataset is provided in footnote 2: 'http://Skylion007.github.io/Open Web Text Corpus'
Dataset Splits No The paper mentions a 'training split' and 'validation BPD' (Table 2), but it does not specify the exact percentages or sample counts for training, validation, or test splits. It implies using a 'training split' from the Open Web Text dataset but does not define how that split was created or its size relative to the whole dataset.
Hardware Specification No The paper mentions 'on a single GPU' and 'on 8 GPUs in parallel' but does not specify the brand, model, or any other specific characteristics of these GPUs or any other hardware components.
Software Dependencies No The paper mentions using 'Py Torch Adam W optimizer' and that the 'pretraining setup described in nano GPT' was used as a backbone, but it does not provide specific version numbers for PyTorch or any other software libraries or dependencies.
Experiment Setup Yes The training batch is randomly sampled with replacement with a context size of 1024 and a batch size of 8. For optimization, we use a Py Torch Adam W optimizer with weight decay set to 10 2, epsilon set to 10 6, and no decay bias (Loshchilov & Hutter, 2017). ... we apply the Lo RA modules on the query and value weight matrices in the attention layers. Additionally, we apply Lo RA on the linear head of the model. In both cases, we use a Lo RA alpha value of 32 and dropout ratio of 0.1. ... we perform a grid search over subspace dimensions d {5000, 10000, 25000, 50000, 100000, 200000}, Lo RA rank r {1, 4}, learning rate lr {2 10 4, 5 10 3, 5 10 5}, and mixing parameter for prediction smoothing α {0.0001, 0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5}. We also consider two different values for the quantization levels C {11, 17}.