Position: Understanding LLMs Requires More Than Statistical Generalization
Authors: Patrik Reizinger, Szilvia Ujváry, Anna Mészáros, Anna Kerekes, Wieland Brendel, Ferenc Huszár
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through three case studies: (1) the nonidentifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of finetunability. ... Empirical demonstration. We train a decoder-only Transformer (Vaswani et al., 2017; Radford et al., 2018) on the anbn PCFG and evaluate zero-shot rule extrapolation (Fig. 2), measured as the proportion of times OOD prompts of length 8 are completed consistently with rule R1. |
| Researcher Affiliation | Academia | 1Max Planck Institute for Intelligent Systems, Tübingen, Germany 2University of Cambridge, UK 3AI Center, UCL, London, UK 4Department of Computer Science, ETH Zurich 5Max Planck ETH Center for Learning Systems 6ELLIS Institute Tübingen, Germany 7Tübingen AI Center, Germany. |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | Our code and experimental logs are publicly available at https://github.com/rpatrik96/llm-non-identifiability. |
| Open Datasets | No | We generate data from the anbn PCFGs up to length 256. Besides the tokens a (0) and b (1), we use SOS (2), EOS (3), and padding (4) tokens. We define our test prompts as all possible sequences of length 8 (prepended with SOS), which we split into in-distribution, and OOD test prompts, based on whether they can be completed in the form of anbn. The training set includes all unique sequences up to length 256. |
| Dataset Splits | No | We monitor training and validation loss, and the adherence to the grammar s two rules (R1),(R2). ... Table 4. Comparison of the extrapolation performance of MLE, adversarial, and oracle training for OOD prompts. For (approximately) the same validation loss, the extrapolation of (R1) for OOD prompts differs enormously, showing that the loss alone cannot distinguish the extrapolation property. |
| Hardware Specification | No | This research utilized compute resources at the Tübingen Machine Learning Cloud, DFG FKZ INST 37/1057-1 FUGG. |
| Software Dependencies | Yes | We use Py Torch (Paszke et al., 2019), Py Torch Lightning (Falcon & The Py Torch Lightning team, 2019), and Hugging Face Transformers (Wolf et al., 2020). |
| Experiment Setup | Yes | Table 3. Transformer parameters: NUMBER OF LAYERS 5, DROPOUT PROBABILITY 0.1, MODEL DIMENSION 10, FEEDFORWARD DIMENSION 1024, NUMBER OF ATTENTION HEADS 5, LEARNING RATE 2e-3, BATCH SIZE 128, NUMBER OF EPOCHS 50,000. |