Grokking Group Multiplication with Cosets
Authors: Dashiell Stander, Qinan Yu, Honglu Fan, Stella Biderman
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Building on previous work, we completely reverse engineer fully connected one-hidden layer networks that have grokked the arithmetic of the permutation groups S5 and S6. ... We apply a methodology inspired by Geiger et al. [17] to use causal experiments to thoroughly test all of the properties of our proposed circuit. ... Table 1. Causal interventions aggregated over 128 runs on S5 with different sizes |
| Researcher Affiliation | Collaboration | 1Eleuther AI 2Brown University 3University of Geneva. |
| Pseudocode | No | The paper describes methods and processes in text but does not include any explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | All code necessary for reproducing results and analysis is available at https://www.github.com/dashstander/sn-grok |
| Open Datasets | No | The paper states it studied models that learned to multiply permutations of S5 and S6, which are mathematical groups. It does not provide a link or specific citation to a pre-existing, publicly available dataset in the conventional sense (e.g., a downloadable file or a dataset repository). |
| Dataset Splits | Yes | The models we study exhibit grokking , wherein the model first memorizes the training set and then much later generalizes to the held out data perfectly. ... As the validation loss approaches a small value... ... Table 2. Experiment hyperparameters. Group % Train Set: 40% |
| Hardware Specification | Yes | All models were trained on NVIDIA GeForce RTX 2080 GPUs. |
| Software Dependencies | Yes | All models were implemented in Py Torch Paszke et al. [45] ... Analysis and reverse engineering was performed with Vink et al. [56], Nanda & Bloom [39], Harris et al. [22], GAP [16], Stein et al. [53]. (References cite specific versions: Python Polars 0.19.0, GAP 4.12.2, Sage Mathematics Software 10.0.0) |
| Experiment Setup | Yes | All models were implemented in Py Torch Paszke et al. [45] and trained with the Adam optimizer [28] with a fixed learning rate of 0.001, weight decay set to 1.0, β1 = 0.9 and β2 = 0.98. ... Table 2. Experiment hyperparameters. Group % Train Set Num. Runs Num. Epochs Linear Layer Size Embedding Size |