The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains

Authors: Ezra Edelman, Nikolaos Tsilivis, Benjamin Edelman, Eran Malach, Surbhi Goel

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct an empirical and theoretical investigation of this multi-phase process, showing how successful learning results from the interaction between the transformer s layers, and uncovering evidence that the presence of the simpler unigram solution may delay formation of the final bigram solution.
Researcher Affiliation Academia Ezra Edelman University of Pennsylvania ezrae@cis.upenn.edu Nikolaos Tsilivis New York University nt2231@nyu.edu Benjamin L. Edelman Harvard University bedelman@g.harvard.edu Eran Malach Harvard University emalach@g.harvard.edu Surbhi Goel University of Pennsylvania surbhig@cis.upenn.edu
Pseudocode No The paper describes the model architecture and its mathematical formulation but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code: https://github.com/Ezra Edelman/Evolution-of-Statistical-Induction-Heads.
Open Datasets No Our learning task consists of sequences generated from Markov Chains with random transition matrices. ... The data was generated in an online fashion, using numpy.random.dirichlet to generate each row of the transition matrices.
Dataset Splits No The paper describes online data generation and reports test loss, but does not specify explicit training/validation/test splits or percentages.
Hardware Specification Yes All of the experiments were performed with a single NVIDIA Ge Force GTX 1650 Ti GPU with 4 gigabytes of vram with 32 gigabytes of system memory.
Software Dependencies Yes We use Py Torch 2.1.2.
Experiment Setup Yes We train transformers of the form (1) with the Adam W optimizer with learning rate 3 × 10−5 (for 3-grams a learning rate of 3 × 10−2 was used), batch size 64, and hidden dimension 16. The sequence length of the examples is 100 tokens. The minimal model was trained with SGD, with batch size 64, and learning rate 3 × 10−4.