reproducibilityindex.ai

Position: Understanding LLMs Requires More Than Statistical Generalization

Authors: Patrik Reizinger, Szilvia Ujváry, Anna Mészáros, Anna Kerekes, Wieland Brendel, Ferenc Huszár

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through three case studies: (1) the nonidentifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of finetunability. ... Empirical demonstration. We train a decoder-only Transformer (Vaswani et al., 2017; Radford et al., 2018) on the anbn PCFG and evaluate zero-shot rule extrapolation (Fig. 2), measured as the proportion of times OOD prompts of length 8 are completed consistently with rule R1.
Researcher Affiliation	Academia	1Max Planck Institute for Intelligent Systems, Tübingen, Germany 2University of Cambridge, UK 3AI Center, UCL, London, UK 4Department of Computer Science, ETH Zurich 5Max Planck ETH Center for Learning Systems 6ELLIS Institute Tübingen, Germany 7Tübingen AI Center, Germany.
Pseudocode	No	No pseudocode or algorithm blocks are present in the paper.
Open Source Code	Yes	Our code and experimental logs are publicly available at https://github.com/rpatrik96/llm-non-identifiability.
Open Datasets	No	We generate data from the anbn PCFGs up to length 256. Besides the tokens a (0) and b (1), we use SOS (2), EOS (3), and padding (4) tokens. We define our test prompts as all possible sequences of length 8 (prepended with SOS), which we split into in-distribution, and OOD test prompts, based on whether they can be completed in the form of anbn. The training set includes all unique sequences up to length 256.
Dataset Splits	No	We monitor training and validation loss, and the adherence to the grammar s two rules (R1),(R2). ... Table 4. Comparison of the extrapolation performance of MLE, adversarial, and oracle training for OOD prompts. For (approximately) the same validation loss, the extrapolation of (R1) for OOD prompts differs enormously, showing that the loss alone cannot distinguish the extrapolation property.
Hardware Specification	No	This research utilized compute resources at the Tübingen Machine Learning Cloud, DFG FKZ INST 37/1057-1 FUGG.
Software Dependencies	Yes	We use Py Torch (Paszke et al., 2019), Py Torch Lightning (Falcon & The Py Torch Lightning team, 2019), and Hugging Face Transformers (Wolf et al., 2020).
Experiment Setup	Yes	Table 3. Transformer parameters: NUMBER OF LAYERS 5, DROPOUT PROBABILITY 0.1, MODEL DIMENSION 10, FEEDFORWARD DIMENSION 1024, NUMBER OF ATTENTION HEADS 5, LEARNING RATE 2e-3, BATCH SIZE 128, NUMBER OF EPOCHS 50,000.