Grokking Group Multiplication with Cosets

Authors: Dashiell Stander, Qinan Yu, Honglu Fan, Stella Biderman

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Building on previous work, we completely reverse engineer fully connected one-hidden layer networks that have grokked the arithmetic of the permutation groups S5 and S6. ... We apply a methodology inspired by Geiger et al. [17] to use causal experiments to thoroughly test all of the properties of our proposed circuit. ... Table 1. Causal interventions aggregated over 128 runs on S5 with different sizes
Researcher Affiliation Collaboration 1Eleuther AI 2Brown University 3University of Geneva.
Pseudocode No The paper describes methods and processes in text but does not include any explicit pseudocode blocks or algorithms.
Open Source Code Yes All code necessary for reproducing results and analysis is available at https://www.github.com/dashstander/sn-grok
Open Datasets No The paper states it studied models that learned to multiply permutations of S5 and S6, which are mathematical groups. It does not provide a link or specific citation to a pre-existing, publicly available dataset in the conventional sense (e.g., a downloadable file or a dataset repository).
Dataset Splits Yes The models we study exhibit grokking , wherein the model first memorizes the training set and then much later generalizes to the held out data perfectly. ... As the validation loss approaches a small value... ... Table 2. Experiment hyperparameters. Group % Train Set: 40%
Hardware Specification Yes All models were trained on NVIDIA GeForce RTX 2080 GPUs.
Software Dependencies Yes All models were implemented in Py Torch Paszke et al. [45] ... Analysis and reverse engineering was performed with Vink et al. [56], Nanda & Bloom [39], Harris et al. [22], GAP [16], Stein et al. [53]. (References cite specific versions: Python Polars 0.19.0, GAP 4.12.2, Sage Mathematics Software 10.0.0)
Experiment Setup Yes All models were implemented in Py Torch Paszke et al. [45] and trained with the Adam optimizer [28] with a fixed learning rate of 0.001, weight decay set to 1.0, β1 = 0.9 and β2 = 0.98. ... Table 2. Experiment hyperparameters. Group % Train Set Num. Runs Num. Epochs Linear Layer Size Embedding Size