reproducibilityindex.ai

The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains

Authors: Ezra Edelman, Nikolaos Tsilivis, Benjamin Edelman, Eran Malach, Surbhi Goel

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct an empirical and theoretical investigation of this multi-phase process, showing how successful learning results from the interaction between the transformer s layers, and uncovering evidence that the presence of the simpler unigram solution may delay formation of the final bigram solution.
Researcher Affiliation	Academia	Ezra Edelman University of Pennsylvania ezrae@cis.upenn.edu Nikolaos Tsilivis New York University nt2231@nyu.edu Benjamin L. Edelman Harvard University bedelman@g.harvard.edu Eran Malach Harvard University emalach@g.harvard.edu Surbhi Goel University of Pennsylvania surbhig@cis.upenn.edu
Pseudocode	No	The paper describes the model architecture and its mathematical formulation but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code: https://github.com/Ezra Edelman/Evolution-of-Statistical-Induction-Heads.
Open Datasets	No	Our learning task consists of sequences generated from Markov Chains with random transition matrices. ... The data was generated in an online fashion, using numpy.random.dirichlet to generate each row of the transition matrices.
Dataset Splits	No	The paper describes online data generation and reports test loss, but does not specify explicit training/validation/test splits or percentages.
Hardware Specification	Yes	All of the experiments were performed with a single NVIDIA Ge Force GTX 1650 Ti GPU with 4 gigabytes of vram with 32 gigabytes of system memory.
Software Dependencies	Yes	We use Py Torch 2.1.2.
Experiment Setup	Yes	We train transformers of the form (1) with the Adam W optimizer with learning rate 3 × 10−5 (for 3-grams a learning rate of 3 × 10−2 was used), batch size 64, and hidden dimension 16. The sequence length of the examples is 100 tokens. The minimal model was trained with SGD, with batch size 64, and learning rate 3 × 10−4.